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The  explosive  growth  of  today's  data  requires  new  data  mining  techniques  that 
scale  up  well  with  large,  dynamically  growing,  sparse,  and  high-dimensional  data. 
This  research  presents  algorithms  and  software  for  efficient  mining  of  data  with  such 
characteristics.  First,  two  density-based  algorithms  are  introduced  to  solve  the  prob- 
lem of  identifying  meaningful  subspaces  of  a  high-dimensional  data  space.  The  SubAs- 
soc  algorithm  works  in  a  bottom-up  fashion  to  exploit  the  monotonicity  of  the  density 
criterion  with  respect  to  dimensionality  to  prune  the  search  space.  The  DataScan  al- 
gorithm is  a  simplified  version  of  SubAssoc,  which  is  restricted  to  find  only  large 
2-itemsets  during  a  scan  of  the  data. 

An  efficient  method  called  ICE  that  employs  tree-based  sampling  techniques 
is  presented  for  incremental  classification  on  large,  dynamically  growing  data.  This 
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method  is  based  on  weighted  samples  and  is  independent  of  data  distribution.  The 
basic  idea  is  to  represent  the  class  distribution  in  the  dataset  by  using  the  weighted 
samples  which  are  extracted  from  the  nodes  of  intermediate  decision  trees  using  veri- 
ous  sampling  techniques.  The  weighted  samples  from  the  intermediate  classifier  built 
only  on  the  incremental  data  are  combined  with  the  previously  generated  samples  to 
obtain  an  up-to-date  classifier  for  the  entire  data  in  an  efficient,  incremental  fashion. 

We  present  an  advanced  data  server  that  facilitates  fast  and  efficient  access  to 
distributed  data  to  speed  up  data  mining  operations.  The  server  implements  two 
data  models:  the  streaming  mode  which  guarantees  the  fastest  access  to  the  entire 
data,  and  the  selective  mode  that  allows  users  to  access  part  of  the  data  on  demand. 
The  server  provides  an  easy  programming  interface,  and  its  performance  is  evaluated 
in  the  context  of  high  energy  physics  application. 

Finally,  we  introduce  a  software  system  that  appraises  the  websites  as  an  appli- 
cation of  the  subspace  identification  and  incremental  classification  techniques.  The 
Internet  is  one  of  the  areas  that  grow  very  quickly  and  change  continuously.  The 
classified  objects  today  may  not  have  the  same  classification  any  more  after  a  few 
days.  Therefore,  a  new  classification  technique  that  develops  prototypes  quickly  and 
deals  efficiently  with  incremental  data  is  vital  in  these  applications. 
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CHAPTER  1 
INTRODUCTION 


Data  mining  has  been  attracting  tremendous  amount  of  attention  in  the  data 
research  community.  Due  to  the  advances  in  data  storage  technologies  and  advent  of 
large  scale  business  and  scientific  applications,  the  data  for  data  mining  applications 
have  become  very  large  and  may  contain  over  several  millions  of  records.  Each 
record  typically  consists  of  tens  to  hundreds  of  attributes,  sometimes  of  thousands 
for  some  applications  [1].  The  explosive  growth  of  the  Internet,  where  lots  of  data 
mining  operations  are  performed  on  a  daily  basis  in  order  to  make  quick  decisions 
and  hence  acquire  a  competitive  edge  in  the  respective  business,  also  contributes 
to  the  generation  of  data  with  such  characteristics.  As  a  result,  the  enormity  and 
complexity  of  the  data  involved  in  these  applications  make  the  task  of  data  mining 
prohibitively  expensive.  Traditional  data  mining  techniques  falter  when  they  are 
applied  to  the  data  with  these  characteristics.  Therefore,  it  is  imperative  to  develop 
new  data  mining  techniques  that  overcome  the  difficulties  embedded  in  today's  data. 

One  of  the  important  characteristics  of  the  data  mining  problem  is  the  demand 
for  handling  large  volumes  of  ever-growing  data.  The  patterns  discovered  from  a 
dataset  only  reflect  the  current  state  of  the  dataset.  In  reality,  however,  the  data 
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continuously  changes,  making  the  previously  identified  patterns  obsolete  and  intro- 
ducing new  patterns.  This  task  becomes  nontrivial  when  the  volume  of  data  is  very 
large.  Therefore,  in  order  to  keep  the  discovered  patterns  reliable  and  up-to-date,  it 
is  essential  that  efficient  techniques  be  developed  to  update,  maintain  and  manage 
the  discovered  patterns  from  a  large  volume  of  ever-growing  data. 

Another  important  characteristic  of  modern  data  is  its  high  dimensionality.  Tra- 
ditional data  mining  techniques  consider  every  dimension  to  discover  knowledge  from 
the  data.  In  many  times,  however,  a  large  part  of  the  dimensions  seldom  contributes 
to  the  interesting  knowledge  hidden  in  the  data.  The  difficulty  lies  in  the  fact  that 
the  data  space  is  very  sparse  because  of  high  dimensionality  even  if  the  number  of 
instances  is  large.  There  will  be  huge  improvement  in  performance  of  data  min- 
ing operations  if  these  uninteresting  dimensions  can  be  ignored  during  the  process. 
Therefore,  it  is  desirable  to  identify  the  most  meaningful  and  interesting  dimensions 
from  the  high-dimensional  data  to  facilitate  efficient  data  mining.  We  call  this  prob- 
lem the  identification  of  meaningful  subspaces.  The  discovered  subspaces  can  result 
in  a  more  compact  representation  of  data,  and  help  human  beings  understand  the 
complex  structure  of  the  data  more  easily  [2,  3].  In  addition,  the  data  mining  opera- 
tions can  be  performed  more  efficiently  when  applied  only  to  these  subspaces  rather 
than  to  the  entire  data  space. 

The  development  of  Internet  and  network  technologies  have  made  it  possible 
for  large  business  enterprises  and  scientific  institutes  to  store  the  data  where  it  is 
generated.  Traditional  centralized  processing  would  require  broad  network  bandwidth 
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to  bring  the  geographically  dispersed  data  into  one  place.  Furthermore,  the  entire 
data  may  be  too  large  to  gather  in  one  place.  Data  warehousing  takes  an  eager 
approach  by  which  the  data  from  local  sources  is  preprocessed,  aggregated  and  stored 
in  a  central  repository  [4].  However,  the  granularity  of  data  for  data  mining  is  very 
fine,  which  makes  it  difficult  to  deal  with  aggregates.  A  good  example  includes 
transactional  market  basket  data  [5,  6].  Therefore,  it  is  necessary  to  develop  efficient 
data  access  methods  that  support  centralized  data  processing  operations  by  providing 
quick  access  to  the  distributed  data. 

A  plausible  solution  to  each  of  these  problems  stated  above  is  addressed  in 
this  dissertation.  The  solutions  include  the  subspace  identification  algorithms  for 
mining  sparse,  high-dimensional  data  (Chapter  3),  an  efficient  method  for  incremental 
classification  for  large,  dynamically  growing  data  (Chapter  4),  and  an  advanced  data 
server  with  two  data  access  models  (Chapter  5).  An  automatic  Website  rating  system 
is  presented  as  an  application  of  the  proposed  solutions  to  the  problems  of  large, 
dynamically  growing,  and  high-dimensional  data  (Chapter  6).  A  brief  introduction 
to  classification  and  the  related  background  are  given  in  Chapter  2. 

In  the  rest  of  this  chapter,  each  solution  will  be  summarized,  and  the  chapter 
concludes  with  the  contributions  of  this  research. 

Subspace  identification  We  take  a  density-based  approach  [7]  to  identification 
of  meaningful  subspaces  of  high-dimensional  data  [8].  To  approximate  the  density 
of  the  data  points,  each  dimension  of  the  data  space  is  partitioned  into  equal-length 
intervals.  This  means  that  each  unit  has  the  same  volume,  and  therefore  the  number 


4 


of  points  inside  each  cell  of  all  subspaces  can  be  used  to  approximate  the  density  of 
the  cells. 

Two  algorithms  are  proposed  for  subspace  identification:  one  using  an  associa- 
tion rule  mining  (called  SubAssoc)  and  the  other  a  variation  of  the  SubAssoc  (called 
DataScan).  The  SubAssoc  algorithm  works  in  a  bottom-up  fashion  to  exploit  the 
monotonicity  of  the  density  criterion  with  respect  to  dimensionality  to  prune  the 
search  space,  which  is  similar  to  the  Apriori  algorithm  [5].  The  DataScan  algorithm  is 
a  simplified  version  of  the  SubAssoc  algorithm  which  is  restricted  to  find  only  large 
2-itemsets  during  a  scan  of  the  dataset. 

We  discuss  how  to  select  an  optimal  subspace  out  of  a  set  of  many  meaningful 
subspaces  identified.  In  order  to  eliminate  the  support  count  (density)  parameter, 
two  alternatives  are  suggested  as  future  research.  One  is  to  apply  a  hypergraph 
partitioning  algorithm  [9]  such  that  a  hypergraph  is  formed  from  the  large  itemsets 
found  by  an  association  rule  algorithm.  After  the  hypergraph  partitioning  algorithm 
is  applied  to  the  hypergraph,  the  set  of  vertices  with  the  best  connectivity  is  chosen  as 
the  most  meaningful  subspace.  The  ^-squared  test  [10]  can  be  applied  to  wide  range 
of  problems  involving  correlations.  Basically,  the  significance  of  a  subspace  may  be 
interpreted  as  how  much  the  composing  attributes  are  correlated  one  another. 

The  efficiency  and  effectiveness  of  the  proposed  subspace  identification  algo- 
rithms are  empirically  proved  on  two  popular  data  mining  operations:  classification 
and  clustering.  The  experiments  also  show  that  the  algorithms  are  scalable  in  terms 
of  both  the  dimensionality  and  the  size  of  the  data. 
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Incremental  classification  The  problem  of  large,  ever-growing,  and  distributed 
data  is  addressed  in  the  context  of  classification  in  this  research.  We  introduce 
an  efficient  method  called  ICE  for  incremental  classification  on  large  ever-growing 
data  that  employs  tree-based  sampling  techniques  [11].  The  features  of  ICE  include 
scalability,  easy  parallelization,  inherent  migration  into  distributed  environment,  and 
flexibility.  The  sampling  techniques  are  integrated  with  clustering  operation  based  on 
weighted  sampling.  The  basic  idea  is  to  represent  the  class  distribution  in  the  dataset 
by  using  the  weighted  samples  which  are  extracted  from  the  nodes  of  intermediate 
decision  trees  using  a  clustering  technique.  Since  the  data  distribution  is  implicitly 
reflected  into  the  structure  of  the  decision  trees  from  which  the  weighted  samples  are 
extracted,  the  proposed  method  is  independent  of  data  distribution.  Therefore,  the 
proposed  method  can  build  the  decision  tree  classifiers  of  comparable  quality  for  large, 
ever-growing  data  in  an  incremental  fashion.  The  experimental  results  indicate  that 
the  method  combined  with  the  sampling  techniques  outperforms  random  sampling 
and  promises  to  be  a  basis  of  incremental  classification  for  large,  ever-growing  data. 

Data  models  for  efficient  data  mining  Data  mining  applications  usually  deal 
with  large,  distributed  data,  and  require  an  efficient  data  access  method.  We  present 
two  data  models  for  efficient  data  mining  operations,  and  introduce  an  advanced 
data  server  called  ASWAN  that  facilitates  fast  and  efficient  access  to  distributed 
data  [12,  13].  The  server  implements  the  two  data  access  models:  the  streaming 
access  mode  which  guarantees  the  fastest  access  to  the  entire  data,  and  the  selective 
access  mode  that  allows  users  to  access  part  of  the  data  on  demand  very  quickly.  The 
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server  provides  an  easy  programming  interface,  and  its  performance  is  evaluated  in 
the  context  of  high  energy  physics  applications. 

Automatic  Website  rating  system  The  Internet  is  one  of  the  areas  that  grow 
very  quickly  and  change  continuously.  The  classified  objects  today  may  not  have  the 
same  classification  any  more  after  a  few  days.  In  these  applications,  it  is  essential  to 
develop  a  new  technique  that  is  capable  of  generating  prototypes  quickly  and  dealing 
efficiently  with  incremental  data.  As  an  application  of  the  subspace  identification  and 
incremental  classification  techniques,  we  introduce  a  software  system  that  automat- 
ically rates  Websites  based  on  the  data  collected  by  an  intelligent  Web  agent.  The 
system  has  been  successfully  applied  to  a  practical  Internet  business  whose  major 
operation  is  to  appraise  the  Websites  and  assign  an  appropriate  classification. 

Contributions  This  research  addresses  the  problems  involved  with  the  char- 
acteristics of  modern  data,  and  combines  various  data  mining  operations  for  more 
efficient  data  mining.  In  summary,  the  contributions  of  this  research  include  the 
following: 

•  It  provides  efficient  algorithms  that  identify  the  meaningful  subspaces  of  high- 
dimensional  data,  on  which  the  target  data  mining  operations  can  be  performed 
more  effectively. 

•  The  discovered  subspaces  can  result  in  a  more  compact  representation  of  the 
data  and  simplify  the  data  collection  process. 
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•  It  introduces  an  efficient  metliod  to  incrementally  classify  large,  ever-growing 
data  using  the  tree-based  sampling  techniques. 

•  It  presents  two  data  models  and  an  implementation  that  facilitate  efficient 
access  methods  to  large,  distributed  data  for  data  mining. 

•  It  presents  the  first  attempt  to  automatic  appraisal  of  the  Websites  based  on 
all  aspects  that  are  collected  by  an  intelligent  Web  agent. 


CHAPTER  2 
BACKGROUND 


A  brief  introduction  to  classification  is  presented  in  this  chapter.  The  problems 
of  high-dimensional  data  and  its  previous  work  are  given  in  the  next.  Various  methods 
for  classifying  large  data  are  also  described. 

2.1    Introduction  to  Classification 

Classification  is  a  process  of  generating  a  concise  description  or  model  for  each 
class  of  the  training  dataset  in  terms  of  the  predictor  attributes  defined  in  the  dataset. 
Each  record  of  the  training  dataset  consists  of  a  set  of  predictor  attributes  and  is 
tagged  with  a  class  label.  Attributes  with  discrete  domains  are  referred  to  as  cate- 
gorical, whereas  those  with  ordered  domains  are  referred  to  as  numerical.  The  model 
may  be  subsequently  tested  with  a  test  dataset  for  verification  purpose.  The  model 
is  then  used  to  classify  future  records  whose  classes  are  unknown. 

Among  the  classification  models  in  the  literature  [14],  decision  trees  are  partic- 
ularly suited  for  data  mining  for  the  following  reasons.  The  decision  tree  is  easily 
interpreted  and  comprehended  by  human  beings,  and  can  be  constructed  relatively 
fast  [15].  While  training  neural  networks  is  very  expensive,  inducing  decision  tree  is 
efl^cient  and  thus  suitable  for  large  datasets.  Also,  decision  tree  generation  algorithms 
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do  not  require  prior  knowledge  of  domains  or  distributions  on  the  data.  Finally,  de- 
cision trees  result  in  good  classification  accuracy  compared  to  the  other  models  [16]. 
For  these  reasons,  we  will  focus  on  the  decision  tree  model  in  this  paper. 

Derivation  of  decision  tree  classifier  typically  consists  of  a  construction  phase 
and  a  pruning  phase.  In  the  construction  phase,  a  decision  tree  algorithm  recursively 
partitions  the  training  dataset  until  the  halting  criterion-each  partition  consists  en- 
tirely or  dominantly  of  records  from  one  class-is  satisfied  [15,  17,  18,  19].  Each 
non-leaf  node  of  the  tree  contains  a  splitting  condition  which  is  a  test  on  one  or  more 
attributes  and  determines  how  the  data  is  partitioned.  A  split  selection  method  such 
as  gini  index  [15]  is  computed  in  order  to  discover  the  splitting  condition  at  each 
node.  A  partition  which  meets  the  halting  criterion  is  not  divided  further,  and  the 
node  that  represents  the  partition  is  labeled  with  the  dominant  class. 

In  general,  the  decision  tree  generated  in  the  construction  phase  is  perfect  for 
the  known  records  and  may  be  overly  sensitive  to  statistical  irregularities  and  id- 
iosyncrasies of  the  training  dataset.  Thus,  most  algorithms  perform  a  pruning  phase, 
in  which  nodes  are  iteratively  pruned  to  prevent  overfitting  and  to  obtain  a  tree  with 
higher  accuracy  for  unlabeled  records.  An  algorithm  based  on  the  minimum  descrip- 
tion length  (MDL)  principle  [20,  21]  is  used  in  this  research  to  prune  the  decision 
tree.  In  addition,  since  the  pruning  phase  can  be  generally  executed  in-memory  and 
its  cost  is  very  small  compared  to  that  of  the  construction  phase,  we  restrict  our 
attention  only  to  the  construction  phase  in  this  paper. 
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Figure  2.1  shows  a  sample  decision  tree  classifier  for  the  given  training  dataset. 
Each  record  describes  a  person,  which  is  tagged  with  one  of  the  two  class  labels:  CI 
and  C2.  Each  record  is  characterized  by  two  attributes,  age  and  profession,  the  former 
being  numerical  and  the  latter  being  categorical  with  domain  {  clerk,  business,  teacher, 
professor  }  .  The  splitting  conditions  {age  <  35)  and  {profession  =  clerk)  partition 
the  records  into  the  corresponding  classes.  The  goal  of  building  the  classifier  is  to 
discover,  from  the  training  dataset,  concise  and  meaningful  conditions  involving  age 
and  profession  by  which  a  person  is  classified  into  one  of  the  two  classes. 


No 

Age 

Profession 

Class 

1 

24 

Clerk 

CI 

2 

17 

Business 

CI 

3 

22 

Teacher 

CI 

4 

35 

Professor 

CI 

5 

40 

Professor 

C2 

6 

48 

Clerk 

CI 

7 

52 

Teacher 

C2 

8 

46 

Clerk 

CI 

(5,  6.  7,  8) 

C  )  Profession  =  Clerk 


(a)  Training  data  set 


(b)  Decision  tree  classifier  induced 


Figure  2.1.  Example  of  a  decision  tree  classifier 


2.2    Problems  with  High-dimensional.  Sparse  Data 

The  problem  of  high  dimensionality  is  often  tackled  by  requiring  the  user  to 
specify  a  subset  of  the  dimensions  of  the  dataset  [22].   However,  identification  of 
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subspace  by  a  human  being  is  quite  error  prone.  Another  way  to  address  high  dimen- 
sionality is  to  apply  a  dimensionality  reduction  method  to  the  dataset.  The  Kohonen 
self-organizing  feature  map  [23]  is  a  neural  network  based  scheme  that  projects  high 
dimensional  data  into  a  feature  map  of  a  smaller  dimensionality  such  that  the  prox- 
imity relationships  among  input  data  are  preserved.  However,  the  convergence  is  very 
slow  for  high  dimensional  datasets,  depending  upon  initialization. 

The  principal  component  analysis  (PCA)  [24]  or  Karhunen-Loeve  transforma- 
tion [25,  26]  optimally  transform  the  original  data  space  into  a  lower  dimensional 
space  by  forming  dimensions  that  are  linear  combinations  of  the  given  attributes. 
The  new  space  has  the  property  that  distances  between  points  remain  approximately 
the  same  as  before.  PCA  provides  several  guidelines  on  how  to  determine  the  right 
number  of  dimensions  k  for  given  data  based  on  the  proportion  of  variance  explained 
or  the  characteristic  root  of  the  covariance  matrix.  Latent  semantic  indexing  [27]  is 
a  technique  which  is  extensively  used  in  information  retrieval  domain  and  is  similar 
in  nature  to  PCA. 

While  dimensionality  reduction  techniques  may  succeed  in  reducing  the  dimen- 
sionality, they  have  following  shortcomings.  First,  it  is  difficult  to  find  the  right 
number  of  dimensions  k:  a  small  k  may  lose  important  features  of  the  data  whereas 
a  large  k  results  in  too  large  dimensions.  Second,  the  new  dimensions  can  be  difficult 
to  interpret,  making  it  hard  to  understand  in  relation  to  the  original  data  space. 
Finally,  these  techniques  are  not  effective  for  categorical  attributes. 
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Clustering  techniques  have  been  used  by  many  intelligent  software  agents  in  or- 
der to  retrieve,  filter,  and  categorize  high  dimensional  data  such  as  the  web  documents 
available  on  the  World  Wide  Web  [28].  Clustering  is  also  useful  in  extracting  salient 
features  of  related  web  documents  to  automatically  formulate  queries  and  search  for 
other  similar  documents  on  the  Web.  The  extracted  features  are  then  employed  to 
classify  and  categorize  the  documents.  Traditional  clustering  algorithms  either  use  a 
priori  knowledge  of  document  structures  to  define  a  distance  or  similarity  among  the 
documents,  or  use  probablistic  techniques  such  as  Bayesian  classification  [29].  Many 
of  these  traditional  algorithms,  however,  falter  when  the  dimensionality  of  feature 
space  becomes  high.  Moore  et  al.  introduced  two  clustering  algorithms  that  can  ef- 
fectively cluster  the  web  documents,  based  on  generalizations  of  graph  partitioning 
that  do  not  require  pre-specified  ad  hoc  distance  functions,  and  that  are  capable  of 
automatically  discovering  similarities  or  associations  among  these  documents  [28]. 

Entropy-based  feature  ranking  [30]  introduces  an  entropy  measure  for  determin- 
ing the  relative  importance  of  features  using  a  similarity  measure  that  is  based  on 
distance.  The  algorithm  is  based  on  an  observation  that  removing  an  irrelevant  fea- 
ture from  the  feature  set  but  not  so  otherwise.  However,  the  entropy-based  feature 
ranking  can  only  deal  with  small,  low-dimensional  data  because  the  complexity  of 
the  algorithm  is  O(mn^)  where  m  is  the  number  of  features  and  n  is  the  data  size.  In 
addition,  the  algorithm  is  unable  to  handle  data  with  mixed  attributes  (i.e.,  numer- 
ical and  categorical  attributes)  and  requires  discretization  for  numerical  attributes. 
Relief  [31]  is  a  feature  weight  based  algorithm  that  selects  relevant  features  using  a 
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statistical  method.  Relief  can  handle  only  2-class  problems,  and  suffers  as  the  data 
size  and  the  size  of  the  feature  set  increase. 

Projected  clustering  is  a  generalization  of  feature  selection  that  allows  the  se- 
lection of  different  sets  of  dimensions  for  different  subsets  of  the  data.  A  project 
clustering  algorithm  called  PROCLUS  [32]  returns  a  partition  of  the  data  points  into 
clusters,  together  with  the  sets  of  dimensions  on  which  points  in  each  cluster  are 
correlated. 

The  x-squared  test  [10]  for  correlation  from  classical  statistics  has  been  used  for 
measuring  significance  of  relationships.  Brin  et  al.  defines  the  notion  of  interest  using 
the  x-squared  test  to  measure  the  dependence  among  items  in  market  basket  data  [6]. 
The  x-squared  test  by  its  nature  is  particularly  well  suited  for  large  datasets  (whose 
data  space  tends  to  be  more  dense  than  smaller  data  with  the  same  dimensionality) 
because  its  approximation  requires  large  number  of  samples. 

Many  techniques  have  been  proposed  to  scale  up  the  decision  tree  classifiers 
for  large  datasets.  Approximation  techniques  include  discretization  [18,  33],  sam- 
pling [18,  34],  and  the  SPEC  algorithm  [35].  The  tree  induction  algorithm  ID5 
restructures  an  existing  decision  tree  in  a  dynamic  environment  with  an  assumption 
that  the  training  dataset  fits  in-memory  [36].  Utgoff  et  al.  present  a  set  of  restructur- 
ing operations  that  can  be  used  to  derive  a  decision  tree  construction  algorithm  for  a 
dynamically  changing  dataset  while  maintaining  the  optimal  tree  [37].  Meta-learning 
techniques  can  be  used  to  scale  up  the  classification  process  to  large  datasets  while 
achieving  comparable  accuracy  [38]. 
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A  framework  called  Rainforest  is  recently  proposed  for  developing  fast  and  scal- 
able algorithms  for  constructing  decision  trees  that  gracefully  adapt  to  the  amount 
of  main  memory  available  [39].  The  BOAT  framework  adopts  an  optimistic  ap- 
proach to  incrementally  construct  a  decision  tree  from  samples  extracted  by  using 
the  bootstrapping  technique  [40].  With  changes  in  data  distribution,  however,  it  re- 
quires another  pass  over  the  dataset  and  needs  temporary  information  to  incremen- 
tally construct  a  decision  tree.  The  entropy  minimization  heuristic  and  the  minimal 
description  length  (MDL)  principle  are  used  to  discretize  the  range  of  a  continuous- 
valued  attribute  into  multiple  intervals  [33].  An  MDL-based  pruning  algorithm  called 
PUBLIC  interleaves  with  tree  construction  phase  [41]. 

There  have  been  some  efforts  to  use  a  data  mining  operation  for  another  data 
mining  operation.  The  CLIQUE  algorithm  is  a  clustering  algorithm  that  integrates 
association  rule  algorithm  to  find  dense  subspaces  that  otherwise  are  not  dense  in  the 
entire  data  space  [42].  Liu  et  al.  restrict  the  association  rule  algorithm  to  generate 
rules  whose  right-hand  side  is  only  allowed  to  have  a  class  label  [43].  The  generated 
rules  are  then  pruned,  on  which  a  classifier  is  built.  Another  approach  is  to  use  an 
association  rule  algorithm  to  generate  the  classification  rules  with  high  confidence 
(e.g.,  >  90%)  [44].  However,  none  of  these  works  is  for  more  than  one  data  mining 
operation.  Our  work  differs  from  them  in  that  the  subspaces  can  not  only  be  found 
efficiently  but  also  make  the  results  of  the  target  mining  operation  better. 
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2.3    Problems  with  Large.  Growing  Data 

Most  algorithms  in  the  machine  learning  and  statistics  are  main  memory  algo- 
rithms whereas  today's  databases  are  typically  much  larger  than  main  memory  [45]. 
The  construction  phase  of  the  various  decision  tree  classifiers  differs  in  selecting  the 
test  criterion  for  partitioning  a  set  of  records.  IDS  [46]  and  C4.5  [18]  examine  the 
solution  space  of  all  possible  decision  trees  to  a  fixed  depth,  and  employ  a  simple 
scheme  that  selects  a  test  which  minimizes  the  impurity  of  the  partition.  On  the 
other  hand,  CART  [15],  SLIQ  [47],  and  SPRINT  [19]  choose  the  test  with  the  lowest 
gin!  index.  SPRINT  is  an  exact  technique  that  scales  up  the  decision  tree  for  disk- 
resident  data.  The  algorithm  employs  a  pre-sorting  technique  to  avoid  the  sorting 
operation  at  every  node  of  the  tree  using  a  hash  table  during  the  construction  phase. 
A  recently  proposed  algorithm  named  CLOUDS  samples  the  splitting  points  for  nu- 
merical attributes  followed  by  an  estimation  step  to  narrow  the  search  space  of  the 
best  split,  reducing  cost  substantially  while  maintaining  the  quality  of  the  generated 
trees  [17]. 

Discretization  and  sampling  are  approximate  techniques  that  can  be  used  to 
scale  up  the  classifier  for  large  datasets.  Random  sampling  draws  a  subset  of  the 
entire  dataset  randomly  to  build  a  classifier.  C4.5  uses  the  w;mdou;m^  technique  which 
repeatedly  augments  the  samples  by  adding  misclassified  records  [18].  Stratification  is 
a  selective  sampling  method  in  which  samples  are  drawn  within  the  same  classes  [34]. 
The  sampling  techniques  proposed  by  AlSabti  evaluate  the  gini  index  at  only  a  subset 
of  points  along  each  numerical  attribute  [48]. 
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Discretization  is  a  process  of  transforming  a  numerical  attribute  into  an  order 
discrete  attribute  with  a  small  number  of  values,  by  which  the  sorting  operation  is 
avoided  in  the  construction  phase.  All  discretization  methods  for  classification  that 
take  the  class  label  into  account  when  discretizing  assume  that  the  database  fits  into 
main  memory  [18,  33].  The  static  discretization  techniques  perform  the  discretization 
prior  to  building  the  tree  [34].  The  SPEC  algorithm  is  an  approximation  technique 
designed  to  deal  with  disk-resident  data  [35].  It  uses  a  clustering  technique  to  derive 
the  splitting  point  of  the  numerical  attributes. 

The  tree  induction  algorithm  ID5  restructures  an  existing  decision  tree  in  a 
dynamic  environment  [36].  Utgoff  et  al. proposed  a  set  of  restructuring  operations  that 
derive  a  decision  tree  construction  algorithm  for  a  dynamically  changing  dataset  [37]. 
However,  they  also  assume  that  the  dataset  fits  in-memory.  Meta-learning  techniques 
can  be  used  to  scale  up  the  classification  process  to  large  datasets  while  achieving 
comparable  accuracy  [38].  The  techniques  combine  a  set  of  (base)  classifiers  to  form 
a  global  classifier. 

A  framework  called  Rainforest  is  recently  proposed  for  developing  fast  and  scal- 
able algorithms  for  constructing  decision  trees  that  gracefully  adapt  to  the  amount 
of  main  memory  available  [39].  The  BOAT  framework  optimistically  constructs  the 
exactly  same  decision  tree  from  samples  extracted  by  using  the  bootstrapping  tech- 
nique [40].  With  changes  in  data  distribution,  however,  BOAT  requires  another  pass 
over  the  dataset  and  needs  to  keep  temporary  information  to  incrementally  construct 
the  tree. 
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Decision  tree  classifiers  have  different  properties,  which  are  used  to  compare 
classifiers.  The  accuracy  of  a  classifier  is  the  primary  metric.  It  measures  the  pre- 
dictive performance  of  the  classifier  and  is  determined  by  the  percentage  of  the  test 
dataset  examples  that  are  correctly  classified  [49].  The  misclassification  metric  has 
been  defined  to  deal  with  different  misclassification  costs.  The  risk  metric  extends 
the  misclassification  metric  to  include  the  gain  from  correctly  classified  cases.  The 
comprehensibility  metric  prefers  classifiers  that  are  easy  to  understand. 


CHAPTER  3 

IDENTIFYING  MEANINGFUL  SUBSPACES  OF  SPARSE  DATA 


3.1  Introduction 

Due  to  the  advances  in  data  storage  technologies,  and  the  advent  of  large  scale 
business  and  scientific  applications,  the  data  for  data  mining  applications  have  be- 
come very  large  and  may  contain  over  several  millions  of  records.  Each  record  typ- 
ically consists  of  tens  to  hundreds  of  attributes^  even  to  thousands  for  some  appli- 
cations [1],  some  of  them  with  large  number  of  distinct  values.  The  rapid  growth 
of  Internet,  where  lots  of  data  mining  operations  are  performed  every  day,  also  con- 
tributes to  the  generation  of  such  data.  The  enormity  and  complexity  of  the  data 
involved  in  these  applications  make  the  task  of  data  mining  computationally  very 
expensive.  Due  to  high  dimensionality,  however,  the  data  space  is  very  sparse  even  if 
the  dataset  is  large,  and  thus  most  of  the  computation  performed  is  likely  to  become 
wasteful. 

This  research  is  motivated  by  an  observation  that  not  all  of  the  attributes  need 
to  be  examined  to  discover  hidden  knowledge  contained  in  a  high-dimensional,  sparse 

'In  this  dissertation,  we  use  the  terms  dimension  and  attribute  interchangeably.  We  also  use  the 
terms  meaningful  and  significant  interchangeably  when  they  are  used  to  describe  the  significance  of 
a  dimension. 
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dataset.  We  also  notice  that  most  of  the  attributes  are  seldom  picked  up  during  the 
process  of  some  data  mining  operations.  For  example,  only  a  few  attributes  are 
chosen  to  split  data  partitions  during  decision  tree  classifier  construction.  Yet,  all 
the  attributes  are  involved  in  computing  the  split  selection  method  at  every  node 
of  the  decision  tree  being  constructed.  However,  only  a  handful  of  them  are  found 
significant  enough  to  be  used  as  splitting  attributes  when  building  a  classifier.  Even 
if  some  insignificant  attributes  are  chosen  as  splitting  attributes,  it  usually  happens 
deep  down  the  decision  tree  where  they  are  likely  to  be  pruned  during  the  pruning 
phase  [20] .  Since  finding  splitting  points  requires  heavy  computation  for  each  of  the 
attributes,  it  is  desirable  to  identify  a  set  of  the  most  important  and  meaningful 
attributes  and  use  only  them  to  build  a  classifier.  By  identifying  and  using  such 
attributes,  the  process  of  classification  can  be  considerably  simplified  and  become 
more  efficient. 

Example  1  Two  preliminary  results  of  subspace  identification  applied  to  clas- 
sification are  shown  in  Figure  3.1.  The  person  dataset  was  obtained  from  the  IBM 
Quest  home  page  [50]  and  the  letter  dataset  from  the  UCI  Repository  [1].  The  plots 
in  the  figure  are  normalized  against  the  results  of  the  original  classifiers  for  which 
the  entire  attributes  are  used  to  build.  The  attributes  used  in  the  experiments  are 
obtained  from  near  the  top  of  the  decision  tree  classifier  built  on  the  entire  attributes. 
Figure  3.1(a)  surprisingly  shows  that  only  2  out  of  9  attributes  are  needed  to  build  a 
classifier  of  better  accuracy.  In  fact,  the  two  attributes  are  the  ones  that  are  involved 
in  the  predicate  function  used  to  generate  the  dataset.  The  other  quality  metrics  are 


(a)  Classification  on  subspaces  of  person  dataset 


Subspace  classification  on  tetter  dataset:  K15-A16-C26-I16 


8  10  12 

Number  ol  significant  attributes  used 


(b)  Classification  on  subspaces  of  letter  dataset 
Figure  3.1.  Examples  of  meaningful  subspaces  applied  to  classification 
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also  better  than  those  of  the  original  classifier.  Similarly,  in  Figure  3.1(b),  only  12 
out  of  16  attributes  are  needed  to  construct  a  classifier  of  similar  accuracy.  Although 
the  results  for  the  letter  dataset  are  less  impressive,  it  is  still  good  enough  to  reveal 
the  importance  of  subspace  identification.  The  fact  that  the  datasets  used  in  this  ex- 
ample are  not  high-dimensional  indicates  the  applicability  of  subspace  identification 
to  low-dimensional  data  as  well.  □ 
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2-dimensional  grid  for  {salary,age}  dataset 
Figure  3.2.  The  clusters  in  subspaces  of  the  original  data  space  [42] 


Example  2  Another  example  of  the  importance  of  subspace  identification  is 
given  in  Figure  3.2,  where  no  2-dimensional  unit  is  dense  and  hence  there  are  no 
clusters  in  the  entire  data  space.  If  a  clustering  technique  is  applied  to  the  entire 
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data  space,  no  clusters  will  be  found.  However,  if  the  points  are  projected  onto  the 
salary  dimension,  there  are  2  dense  clusters  of  points.  □ 

The  subspace  identification  provides  many  benefits.  It  is  useful  in  particular 
when  it  is  difficult  to  determine  what  to  look  at  and  what  to  focus  on  in  the  beginning 
of  a  data  mining  and  knowledge  discovery  process  [51].  One  can  quickly  collect  as 
many  features  from  the  data  source  as  possible,  and  filter  out  the  insignificant  ones 
using  this  technique.  We  give  an  example  of  this  in  Section  3.3.1.  For  existing 
applications,  it  can  be  used  to  discover  the  dominant  features  of  the  data  and  provide 
a  more  compact  representation  for  the  data.  This  will  eventually  solve  the  problem  of 
understandability  [3].  Because  only  a  small  subset  of  attributes  is  used  for  discovering 
knowledge  from  a  given  dataset,  the  overall  cost  will  be  apparently  lower.  It  also 
yields  a  potential  for  handling  large  datasets  since  it  requires  less  data  access.  It 
makes  this  technique  suitable  in  distributed  environment  where  data  are  vertically 
partitioned  in  several  places  [13]. 

The  significance  of  attributes  is  defined  in  terms  of  associations  between  at- 
tributes based  on  co-occurrences:  an  attribute  is  considered  significant  if  the  domain 
of  the  attribute  contains  values  that  frequently  occur  with  values  of  other  attributes. 
This  naturally  leads  us  to  take  a  density-based  approach  [7,  42]  to  identify  such  at- 
tributes. We  propose  in  this  chapter  two  subspace  identification  algorithms  using 
this  approach:  one  that  employs  association  rules  mining  [5,  52]  to  take  advantage 
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of  the  monotonicity  of  density  criterion,  and  the  other  which  is  a  simple,  yet  power- 
ful, variation  of  the  first  algorithm  that  constructs  a  2-dimensional  histogram  of  the 
dataset. 

A  survey  of  related  work  on  high-dimensional  data  is  given  in  Section  2.2.  The 
subspace  identification  algorithms  are  presented  in  Section  3.2.  The  experimental 
results  are  shown  in  Section  3.3  and  this  chapter  concludes  in  Section  3.4. 

3.2    Algorithms  for  Subspace  Identification 

In  this  section,  we  present  two  algorithms  for  subspace  identification,  each  of 
which  finds  a  set  of  the  most  significant  attributes  given  two  parameters:  the  number 
of  intervals  5  and  the  density  count  r.  We  also  provide  an  algorithm  that  automati- 
cally discovers  all  the  meaningful  subspaces  of  different  sizes  given  a  range  of  density 
counts.  This  algorithm  enables  a  user  to  choose  a  subspace  that  best  fits  the  given 
application. 

3.2.1    Subspace  Identification  Using  Association 

The  significance  of  attributes  is  defined  in  terms  of  associations  between  at- 
tributes based  on  co-occurrences:  an  attribute  is  considered  meaningful  if  the  at- 
tribute domain  contains  significantly  large  number  of  values  that  appear  together 
with  values  of  other  attributes.  This  naturally  leads  us  to  take  a  density-based  ap- 
proach [7,  42]  to  identify  such  attributes.  To  approximate  the  density  of  the  data 
points,  each  dimension  is  partitioned  into  the  same  number  of  equal-length  intervals. 
This  means  that  each  unit  region  made  of  the  same  number  of  overlapping  intervals 
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from  different  attributes  has  the  same  volume,  and  therefore  the  number  of  points 
inside  it  can  be  used  to  approximate  the  density  of  the  region.  The  density  criterion 
is  specified  by  the  support^  parameter,  which  comes  with  the  downward-closed  prop- 
erty [5]:  a  region  composed  of  k  intervals  can  be  dense  only  if  all  of  its  subregions 
of  size  k  —  I  are  dense.  This  property  is  used  to  prune  the  search  space  using  a 
bottom-up  approach.  Because  of  this  property,  we  use  a  modified  association  rules 
algorithm  to  find  dense  subspaces. 

Given  a  set  D  of  data  points  with  d  attributes  and  the  two  parameters,  6  for  the 
number  of  intervals  and  r  for  the  density  count,  the  problem  is  to  find  dense  regions  of 
points  in  all  subspaces  of  the  entire  data  space,  which  meet  the  density  criterion.  The 
problem  of  identifying  significant  attributes  from  the  set  of  dense  regions  is  simplified 
by  employing  a  weight  function  that  measures  the  significance  of  an  attribute  by  its 
association  with  other  attributes. 

Algorithm  The  algorithm  of  subspace  identification  using  association  (denoted 
SubAssoc)  consists  of  three  steps  described  below.  We  will  go  into  details  of  these 
steps  in  the  rest  of  this  section. 

1.  Convert  the  dataset  to  replace  values  of  attributes  with  unique  interval  identi- 
fiers 

2.  Run  an  association  rules  mining  algorithm  on  the  converted  dataset  to  discover 
the  associations  among  the  interval  units  (the  dense  units) 

-In  this  chapter,  we  use  the  terms  density  and  support  interchangeably. 
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3.  Identify  a  set  of  significant  attributes  (a  subspace)  from  the  large  itemsets  found 

Step  1:  Data  conversion  For  a  given  number  of  intervals  5,  the  domain  of  each 
attribute  is  divided  into  5  intervals,  each  of  which  is  assigned  a  unique  identifier.  A 
categorical  attribute  with  x  values  is  always  divided  into  x  intervals,  regardless  of 
the  value  of  5.  By  replacing  the  values  of  the  attributes  with  the  unique  interval 
identifiers,  the  numerical  and  categorical  values  are  essentially  treated  the  same  way. 

While  scanning  the  dataset,  each  value  in  a  record  is  substituted  with  the  corre- 
sponding interval  identifier  within  which  the  value  lies.  When  converting  the  dataset 
for  classification,  the  class  labels  are  also  converted  into  c  intervals  where  c  is  the 
number  of  class  labels.  The  total  number  of  intervals  is  then  i  ■  S  +  w  +  c  where  i  is 
the  number  of  numeric  attributes,  and  w  the  total  number  of  values  in  all  categorical 
attributes^.  This  step  requires  one  scan  of  the  dataset. 

Step  2:  Discovery  of  associations  among  intervals  An  association  rules  algorithm  [5, 
52]  is  applied  to  the  converted  dataset  with  a  given  density  count  r.  Each  interval  in 
a  record  is  regarded  as  an  attribute  by  the  association  rules  algorithm.  However,  the 
algorithm  is  executed  with  two  restrictions.  First,  the  attributes  (i.e.,  the  intervals, 
call  it  subattributes)  from  the  same  original  attribute  are  not  considered  together  in 
the  same  itemset  during  the  candidate  generation  since  they  cannot  appear  in  the 
same  record.  Second,  every  candidate  large  itemset  must  include  an  interval  for  the 
class  label  if  the  dataset  is  for  classification.  The  chosen  association  rules  algorithm 
^This  number  may  be  less  if  6  is  greater  than  the  number  of  distinct  values  for  an  integer  attribute. 
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can  execute  faster  under  these  restrictions,  compensating  for  the  cost  incurred  by  the 
increase  in  the  number  of  attributes  to  be  dealt  with. 

For  efficiency,  the  association  rules  algorithm  may  be  further  restricted  to  gener- 
ate large  itemsets  of  size  up  to  k.  Although  most  of  the  time  is  spent  in  finding  large 
2-  or  3-itemsets  [5],  this  restriction  will  save  the  cost  of  this  step  when  the  number  of 
desired  attributes  is  known  in  advance.  The  essence  of  this  step  is  to  determine  the 
value  of  r.  A  too  small  r  will  generate  too  many  large  itemsets  (i.e.,  dense  units), 
while  a  too  large  r  will  result  in  few.  However,  it  is  difficult  to  predict  the  appropriate 
value  of  T  beforehand.  We  will  address  this  issue  in  Section  3.2.3. 

Step  3:  Identification  The  output  of  step  2  will  be  the  large  itemsets  whose 
components  consist  of  the  interval  identifiers,  each  of  which  belongs  to  a  different 
attribute,  the  density  count  of  each  itemset,  and  a  class  label  if  for  classification. 
Once  this  information  is  available,  we  are  to  decide  which  attributes  are  significant.  A 
simple  method  would  be  to  choose  a  large  fc-itemset  with  the  largest  density  count  for 
a  given  k.  There  can  be  many  large  A;-itemsets,  however,  especially  for  classification 
data  [43],  and  each  itemset  represents  only  one  unit  region,  bringing  the  need  of  a 
method  that  accurately  determines  the  significance  of  attributes  in  terms  of  the  entire 
data  space.  Hence,  we  introduce  a  simple  weight  function  Wk{A)  that  measures  the 
significance  of  attribute  A  in  terms  of  the  large  A;-itemsets  found  in  the  step  2. 
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I A  is  the  number  of  large  A:-itemsets  in  which  the  attribute  A  appears,  and  Ca  is 
the  sum  of  density  counts  for  those  large  /c-itemsets.  T  and  N  are  the  total  number 
of  large  /c-itemsets  and  the  total  number  of  records,  respectively.  The  meaning  of 
this  weight  function  is  simple:  the  attribute  that  appears  in  many  large  /c-itemsets  is 
given  more  weight,  and  hence  is  considered  more  important.  This  function  calculates 
the  weight  of  every  attribute  that  appears  in  any  large  /c-itemsets.  Therefore,  a  small 
k  may  be  enough  to  identify  n  significant  attributes  for  k  <^  n.  Given  the  large 
itemsets  from  the  step  2,  a  simple  program  can  calculate  Wk  for  all  attributes  for 
any  given  k.  Then,  the  attributes  are  arranged  in  the  order  of  Wk,  and  the  n  best 
attributes  are  chosen  as  the  most  significant. 

Analysis  Note  that  step  2  can  be  combined  with  step  1  by  using  the  converted 
record  as  soon  as  it  becomes  available,  thereby  saving  a  scan  of  the  dataset.  There- 
fore, this  algorithm  requires  one  extra  scan  of  the  dataset  together  with  the  cost  of 
running  the  chosen  association  rules  algorithm.  The  running  time  of  association  rules 
algorithm  depends  more  on  the  size  and  the  characteristics  of  the  dataset  than  on 
the  number  of  attributes.  In  other  words,  the  number  of  candidates  is  not  propor- 
tional to  the  number  of  attributes.  This  characteristic  makes  this  method  suitable 
for  high-dimensional  datasets.  The  cost  of  this  algorithm  can  be  reduced  by  adopting 
a  high  performance  association  rules  algorithm  such  as  the  CLT  algorithm  [52].  The 
CLT  algorithm  uses  a  lexicographic  trie  to  count  the  itemsets  and  can  outperform 
the  existing  algorithms  such  as  Apriori  [5]  and  dynamic  itemset  counting  [53]  by  more 
than  an  order  of  magnitude.  An  MDL-based  pruning  technique  can  also  be  used  as 
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in  the  CLIQUE  algorithm  [42]  to  speed  up  a  similar  process.  In  this  research,  we  use 
the  CLT  algorithm  to  discover  dense  units. 

3.2.2    Sensible  Alternative:  Interactive  2-D  DataScan 

Based  on  an  observation  made  in  step  2  of  the  SubAssoc  algorithm,  we  propose 
a  simple,  yet  powerful,  variation  that  employs  a  2-dimensional  data  scan  (denoted 
DataScan).  We  observe  that  no  large  n-itemsets  needs  to  be  found  to  get  n  significant 
attributes.  In  other  words,  several  large  /j-itemsets  for  A;  <  n  can  be  combined  to 
get  n  significant  attributes.  The  ordering  of  attributes  based  on  their  significance 
makes  this  possible.  In  addition,  there  may  not  exist  large  n-itemsets  in  the  dataset 
for  certain  n.  Therefore,  we  combine  the  first  two  steps  of  the  SubAssoc  into  one, 
and  limit  the  size  of  the  large  itemsets  to  be  found  to  only  2,  then  scan  the  dataset 
to  build  a  2-dimensional  histogram  for  each  pair  of  intervals  of  different  attributes 
using  a  structure  described  in  Figure  3.3. 

Figure  3.3  shows  the  histogram  counters  for  attribute  Aj  with  i  intervals.  Each 
interval  of  attribute  Aj  additionally  keeps  the  interval  structures  of  the  attributes 
{Ajn}  that  come  lexicographically  after  Aj  for  j  <  m.  Therefore,  the  histogram  will 
be  diagonally  structured  with  attribute  Ai  keeping  counters  for  all  other  attributes 
and  with  attribute  Ad  keeping  none.  For  each  pair  of  values  {vj,Vm)  from  attributes 
Aj  and  Am  that  appear  together  in  a  record,  the  corresponding  interval  identifiers 
are  found  for  the  values.  Using  the  interval  identifier  for  Vj  of  Aj,  a  histogram 
structure  for  attributes  Aj+i  to  A^  is  located.  Then  the  count  of  the  interval  for  Vm 
of  Am  is  incremented  by  one.  In  the  figure,  for  example,  the  values  within  interval 
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2  of  attribute  Aj  appear  in  x  records  together  with  the  values  within  interval  4  of 
attribute  Aj+i.  Each  interval  counter  may  keep  the  counters  for  class  labels  so  that 
class  labels  are  considered  in  identifying  dense  intervals,  if  for  classification. 
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Aiir  j+l 
Attrj+2 


d:  The  number  of  dimensions 


:  The  number  of  intervals 


Figure  3.3.  The  interval  counters  for  attribute  k  with  i  intervals 


Analysis  This  algorithm  requires  only  one  scan  of  the  dataset  to  construct  a 
2-dimensional  histogram  of  the  dataset.  The  primary  advantage  is  that  it  allows 
an  interactive  and  iterative  way  of  identifying  significant  attributes  with  less  cost. 
Once  the  2-dimensional  histogram  is  constructed  for  a  given  number  of  intervals,  one 
can  interactively  select  a  desired  density  count  and  promptly  get  the  corresponding 
significant  attributes  in  the  order  of  weights  W2.  This  process  can  also  be  automated 
to  find  the  best  subspaces  of  different  sizes  as  described  in  Section  3.2.3.  Therefore, 
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this  algorithm  will  not  only  save  the  cost  of  running  an  association  rules  algorithm, 
but  also  simplify  the  overall  process  of  subspace  identification. 

However,  the  amount  of  memory  needed  is  proportional  to  0(5  ■  (P)  where  6  is 
the  number  of  intervals,  and  d  is  the  number  of  dimensions.  One  direct  solution  to 
this  problem  is  to  scan  the  dataset  and  prune  those  intervals  and  attributes  that  are 
not  dense  beforehand.  This  can  considerably  reduce  the  amount  of  memory  needed 
at  the  cost  of  an  additional  scan  of  the  dataset.  Another  solution  is  to  use  a  hash 
table  that  keeps  the  counters  for  the  pairs  of  intervals  that  are  dense  to  prune  the 
unoccupied  pairs. 

3.2.3    Automatic  Discovery  of  All  Meaningful  Subspaces 

In  this  section,  we  present  an  application  of  the  DataScan  algorithm  to  automat- 
ically discover  all  meaningful  subspaces  of  different  sizes.  The  subspace  identification 
algorithms  given  in  the  previous  section  can  find  only  one  subspace  at  a  time  given  a 
density  count  and  a  number  of  intervals.  In  practice,  however,  we  want  to  find  a  sub- 
space  that  can  result  in  the  best  performance  without  loss  of  quality.  In  Figure  3.4, 
we  suggest  an  algorithm  that  iteratively  collects  a  set  of  meaningful  subspaces  of 
different  sizes,  each  of  which  best  represents  the  dataset  with  the  given  number  of 
attributes  regardless  of  the  density  count  t.  The  algorithm  requires  a  minimum  den- 
sity threshold  to  start  with  and  a  density  increment  8^nc-  The  parameters  to  the 
algorithm  may  be  adjusted  accordingly  based  on  the  requirements  of  the  application. 
How  to  choose  the  best  subspace  is  also  discussed  in  this  section. 
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In  the  algorithm,  T  is  an  initially  empty  set  of  meaningful  subspaces  in  which 
will  contain  Ti  for  2  <  i  <  d  for  a  dataset  of  dimensionality  d  when  the  algorithm 
terminates.  The  inequality  S  >  Tk  means  the  subspace  S  is  better  than  the  subspace 
Tk  in  terms  of  significance  (when  |51  =  ir^l  -  k).  In  this  research,  we  used  the  sum 
of  the  weights  of  k  attributes  in  S  as  the  significance  of  the  subspace  S.  Therefore, 
if  the  weight  sum  of  S  is  greater  than  that  of  Tk,  we  say  S  is  better  than  Tk- 


T     0;  //the  set  of  all  meaningful  subspaces 

Run  2-D  datascan  and  keep  the  histogram; 
for  each  density  S  from  Smm  by  Sine  do 

Find  the  best  subspace  S  of  size  k  for  density  6  from  the  histogram; 

if  [S  =  0)  then  break;  //  done 

if  {S  >  Tk)  then  T/t  ■<—  5;  / /  Tk  is  the  best  subspace  of  k  dimensions 

end 

Choose  an  optimal  T  from  T; 

Figure  3.4.  An  algorithm  for  automatic  discovery  of  all  meaningful  subspaces 

The  DataScan  algorithm  is  performed  on  the  dataset  and  its  histogram  is  kept 
in  the  memory  while  the  algorithm  proceeds.  For  each  density  5,  the  algorithm  finds 
the  best  subspace  S  (whose  size  happens  to  be  k)  from  the  histogram.  The  algorithm 
terminates  if  S  is  empty,  when  there  is  no  meaningful  subspace  with  a  given  density 
6.  This  is  guaranteed  as  the  density  S  increases.  The  subspace  S  is  compared  against 
Tk,  the  current  best  subspace  of  size  k.  S  replaces  T^t  if  5  >  Tk-  When  the  algorithm 
terminates,  T  contains  the  subspaces  of  sizes  from  2  to  d  that  best  represent  the 
dataset,  regardless  of  the  density  parameter. 
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Note  that  T  may  not  contain  all  d  -  1  meaningful  subspaces.  Based  on  the 
characteristics  of  the  dataset,  it  may  contain  only  a  few  significant  subspaces.  Fur- 
thermore, the  sizes  of  the  subspaces  in  T  [k  for  any  Tk  G  T)  may  be  too  big  to  be 
practically  used  (i.e.,  k  ^  d).  In  that  case,  the  best  m  attributes  can  be  selected  from 
an  appropriate  subspace  of  size  k  where  m  <  k.  It  seems  reasonable  to  choose  such 
m  attributes  from  the  smallest  subspace.  However,  more  study  needs  to  be  done  to 
identify  the  best  subspace  to  choose  m  best  attributes  from.  We  address  this  issue 
in  the  remainder  of  this  section. 

Identification  of  an  optimal  subspace  It  is  difficult  to  find  out  which  subspace 
is  optimal  unless  the  subspace  is  applied  to  the  target  mining  operation  and  the  result 
is  compared  to  that  for  which  the  entire  attributes  are  used.  We  define  a  quasi- optimal 
subspace  to  be  a  subspace  that  can  generate  equal  or  better  result  when  the  target 
data  mining  operation  is  applied,  than  that  for  which  the  operation  is  applied  to  its 
superspace.  A  set  of  attributes  A  is  a  superspace  of  B  if  B  C  ^.  By  the  definition, 
there  can  be  more  than  one  quasi-optimal  subspace.  In  spite  of  its  completeness  of 
the  algorithm  given  in  Figure  3.4,  we  still  need  a  method  that  enables  us  to  identify 
an  optimal  subspace  from  T. 

The  simplest  method  would  be  to  arbitrarily  choose  a  reasonable  (or  desired) 
number  of  attributes  based  on  the  given  application.  With  the  dimensionality  d  of  the 
dataset  known,  say  100,  it  would  not  make  sense  to  choose  too  small  (e.g.,  4)  or  too 
large  (e.g.,  90)  number  of  significant  attributes.  Even  if  the  choices  based  on  common 
sense  work  well  most  of  the  time,  however,  we  need  to  quantify  the  goodness  of  the 
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subspaces  in  order  to  choose  the  best  one.  We  used  the  weight  sum  of  constituent 
attributes  to  compare  the  goodness  of  the  subspaces  of  the  same  size  in  the  automatic 
discovery  algorithm.  In  order  to  compare  the  goodness  of  subspaces  of  different  sizes, 
other  metrics  such  as  the  average  weight  may  be  used. 

Alternatively,  we  can  take  advantage  of  the  properties  of  the  target  data  mining 
operation  to  identify  a  quasi-optimal  subspace.  For  classification,  for  instance,  the 
properties  of  a  classifier  such  as  the  accuracy,  execution  time  and  amount  of  memory 
needed  can  be  used  to  measure  the  goodness  of  a  subspace.  One  good  candidate 
property  is  the  accuracy.  In  general,  using  large  datasets  results  in  the  improvement 
in  the  accuracy  of  the  classifier  [38,  34].  However,  accuracy  is  not  a  monotonically 
increasing  function  of  the  number  of  attributes  as  in  the  preliminary  experiments 
shown  in  Figure  3.1.  A  set  of  the  most  significant  attributes  may  result  in  a  more 
accurate  classifier  because  it  may  lack  the  effect  of  overfitting  [20]. 

Suppose  that  we  select  a  subspace  of  size  k  from  the  automatic  discovery  algo- 
rithm. Given  a  number  m  such  that  m  <  k,  we  start  from  building  a  classifier  on 
m  most  significant  attributes  of  the  subspace.  Then  we  build  Tj^-i  on  m  -  1  most 
significant  attributes.  If  the  accuracy  of  is  equal  to,  or  better  than,  that  of  T^-i, 
the  procedure  stops.  Otherwise,  the  procedure  repeats  until  this  condition  is  met  or 
up  to  a  given  number  of  iterations.  The  best  classifier  is  chosen  among  the  ones  built. 

This  method  seems  expensive  since  it  needs  to  build  several  classifiers,  although 
each  of  them  is  built  on  a  small  subspace  with  less  cost.  For  m  <  rf,  the  cost  may  be 
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still  less  than  that  for  the  original  classifier.  Moreover,  the  classifier  may  be  perfect- 
fitted  with  less  number  of  attributes,  resulting  in  a  higher  accuracy,  compensating  for 
the  higher  cost,  as  shown  in  Figure  3.1.  Experimental  results  show  that  the  accuracy 
reaches  a  stable  state  for  some  j  near  m  (Figures  3.5  through  3.10).  The  other  mea- 
sures are  either  a  monotonic  function  of  the  number  of  attributes  or  unpredictable. 
Therefore,  if  m  is  chosen  around  the  stable  state,  a  quasi-optimal  subspace  can  be 
found  quickly  after  only  2  or  3  iterations. 

3.3    Experimental  Results 

In  this  section,  we  present  the  experimental  results  for  which  classification  and 
clustering  operations  are  applied  to  prove  the  effectiveness  of  the  subspace  identifica- 
tion algorithms.  For  classification,  we  used  a  simple  decision  tree  classifier  algorithm 
that  exhaustively  searches  for  the  best  splitting  points  [15,  19].  At  each  node,  the 
algorithm  sorts  all  the  numeric  attributes  of  the  records  to  discover  the  best  splitting 
point.  The  histogram  of  categorical  attributes  are  computed  at  the  root  node  and 
passed  down  the  tree  iteratively.  We  used  the  gini  index  [15]  as  the  split  selection 
method:  the  attribute  with  the  lowest  gini  index  is  chosen  to  split  the  records  at 
each  node.  For  clustering,  we  use  synthetic  datasets  [54]  to  show  that  the  subspaces 
used  to  define  clusters  can  be  captured  by  the  subspace  identification  algorithms. 
We  will  call  the  case  of  subspace  identification  applied  to  classification  as  subspace 
classification. 

The  quality  metrics  considered  for  classification  are  accuracy,  size  of  the  pruned 
classifier,  execution  time,  and  amount  of  memory  used  [49].  Each  of  these  metrics 
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from  the  experimental  results  is  normalized  to  its  counterpart  of  the  classifier  upon 
which  the  entire  set  of  attributes  is  applied  (the  original  classifier).  The  normalized 
results  will  reveal  the  extent  of  improvement  that  the  subspace  identification  algo- 
rithms achieve  over  the  original  classifier,  and  make  the  results  independent  of  the 
machine  in  which  the  experiments  are  performed.  Table  3.1  shows  the  description  of 
the  datasets  used  in  the  experiments.  The  real  datasets  used  are  acquired  from  the 
UCI  Repository  [1].  The  numbers  in  parentheses  in  the  Attributes  column  indicates 
the  number  of  numeric  attributes  and  the  number  of  categorical  attributes,  respec- 
tively. Although  the  datasets  used  are  not  very  high-dimensional,  especially  for  the 
real  datasets,  we  found  them  sufficient  to  show  the  effectiveness  of  the  subspace  iden- 
tification algorithms.  The  fc-fold  cross  validation  was  used  on  the  datasets  without 
test  set,  which  is  denoted  as  CV-A;  on  the  top  of  the  figures. 


Table  3.1.  The  description  of  the  datasets  used  in  the  experiments 


Dataset 

Attributes 

Classes 

Train 

Test 

Clusters 

Noise 

letter 

16  (16/0) 

26 

20,000 

a  US-credit 

14  (6/8) 

2 

690 

optic 

64  (64/0) 

10 

3,823 

1,797 

splice 

60  (0/60) 

3 

3,190 

spambase 

57  (57/0) 

2 

4,600 

Synthetic 

50-200  (*/0) 

3 

10^  -  10^ 

800-8,000 

15 

10% 

Based  on  our  experience,  we  have  found  that  it  is  somewhat  more  convenient 
to  use  the  DataScan  algorithm  than  the  SubAssoc  algorithm  to  discover  the  signifi- 
cant attributes.  The  reason  is  that  it  provides  almost  identical  results  as  those  by  the 
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SubAssoc  while  allowing  us  to  quickly  choose  the  best  subspace.  Since  there  is  no  fun- 
damental difference  between  the  SubAssoc  and  the  DataScan  algorithms,  the  results 
of  the  DataScan  algorithm  will  be  given  in  this  section  unless  explicitly  mentioned 
otherwise. 

Synthetic  data  generation  We  use  a  synthetic  data  generator  [54]  to  produce 
datasets  with  clusters  of  high  density  in  specific  subspaces.  The  data  generator  allows 
control  over  the  structure  and  the  size  of  datasets  through  parameters  such  as  the 
number  of  records,  the  number  of  attributes,  the  number  of  classes,  and  the  range 
of  values  for  each  attribute.  It  also  produces  test  datasets  for  classification  from 
the  specified  clusters.  In  the  experiments,  the  range  of  values  was  set  to  [0,100]  for 
all  attributes.  The  description  of  the  generator  and  the  specifications  used  for  data 
generation  can  be  found  in  the  technical  report  for  the  generator  [54]. 

A  synthetic  dataset  is  described  in  terms  of  the  parameters  used  to  generate  the 
dataset.  The  parameters  are  given  in  the  following.  For  instance,  K20-A16-C26-I16 
given  in  Figure  3.5  represents  the  dataset  with  20,000  records  (K),  16  attributes  (A), 
26  classes  (C),  and  16  intervals  (I).  The  other  upper  case  letters  used  are:  S  for  the 
number  of  significant  attributes  used  to  specify  the  clusters,  U  for  the  number  of 
significant  attributes  used  in  subspace  classification,  P  for  the  number  of  clusters, 
and  N  for  the  ratio  of  noise  records  to  the  total  number  of  records.  The  ratio  of  noise 
is  set  to  10%  for  all  synthetic  datasets  used  in  the  experiments. 
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Naive  approach  For  subspace  classification,  a  naive  approach  would  be  to  run 
a  classifier  algorithm  on  a  given  dataset  until  m  distinct  attributes  are  found  as  the 
splitting  attributes.  Such  restricted  pre- classification  (denoted  PreClass)  is  presum- 
ably capable  of  providing  a  good  set  of  significant  attributes  for  classification.  The 
reason  is  that  the  splitting  attributes  used  near  the  top  of  the  decision  tree  will  affect 
the  accuracy  of  the  classifier  the  most.  However,  it  may  cost  as  much  as  building  the 
classifier  upon  all  the  attributes,  depending  on  the  choice  of  m.  It  is  also  difficult 
to  choose  the  appropriate  value  of  m.  The  experimental  results  of  PreClass  are  also 
given  in  this  section  for  comparison  purpose. 

3.3.1    Effectiveness  of  Subspace  Identification 

Figures  3.5  to  3.10  show  the  normalized  results  of  classification  on  meaningful 
subspaces  against  the  original  classification  upon  the  entire  attributes.  The  dotted 
straight  line  on  the  top  of  each  figure  is  for  the  result  of  the  original  classifier  (normal- 
ized to  100%),  against  which  the  other  plots  are  normalized.  The  automatic  discovery 
algorithm  given  in  Section  3.2.3  was  used  to  identify  a  quasi-optimal  subspace.  For 
letter,  optic,  and  splice  datasets,  subspaces  of  size  14  at  0.6%  support,  of  size  43  at  8% 
support,  and  of  size  11  at  11%  support  were  chosen,  respectively.  On  the  other  hand, 
for  aus-credit,  spambase,  and  the  synthetic  datasets,  the  smallest  subspace  found  was 
of  size  12  at  20%  support,  of  size  17  at  40%  support,  and  of  size  23  at  1.1%  support, 
respectively.  We  selected  the  best  8  attributes  for  aus-credit,  the  best  12  for  spambase, 
and  the  best  10  for  the  synthetic  dataset. 


Figure  3.5.  The  subspace  classification  applied  to  letter  dataset 


Figure  3.6.  The  subspace  classification  applied  to  aus-credit  dataset 
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The  figures  show  the  quality  metrics  of  the  decision  tree  classifiers  built  on  the 
chosen  subspaces,  with  the  size  of  the  subspaces  used  starting  from  the  chosen  number 
(the  rightmost  columns)  and  decreasing  by  one  (moving  to  the  left) .  For  comparison 
purpose,  we  provide  the  plots  denoted  RandomS  for  the  normalized  accuracies  of  the 
classifiers  built  on  the  corresponding  number  of  attributes  that  are  randomly  selected. 
The  accuracies  are  averages  of  5  executions.  Each  chosen  subspace  turns  out  to  be 
a  quasi-optimal  subspace  except  for  the  aus-credit  and  the  synthetic  datasets  whose 
quasi-optimal  subspaces  are  even  smaller. 

The  amount  of  memory  and  the  execution  time  are  apparently  less  than  those 
required  by  the  original  classifier.  For  all  datasets  except  aus-credit  and  spambase, 
the  chosen  subspaces  result  in  a  classifier  of  equal  or  better  quality.  Interestingly,  for 
aus-credit  dataset,  only  3  out  of  14  attributes  seem  good  enough  to  build  a  classifier 
of  better  quality.  For  other  datasets,  we  can  also  obtain  classifiers  of  better  accu- 
racy with  significantly  less  number  of  attributes.  This  presents  another  example  of 
overfitting:  the  classifier  built  upon  all  the  attributes  may  be  overfitted. 

The  optic  dataset  is  to  recognize  the  features  of  handwritten  letters.  The  size 
of  the  significant  subspace  (43)  chosen  seems  large  compared  to  the  entire  attributes 
(64).  What  the  results  imply,  however,  is  that  it  enables  us  to  know  what  are  the 
more  important  features  for  a  given  task.  This  may  help  in  refining  data  collection 
process  and  understanding  the  task  more  soundly.  It  also  holds  true  for  the  splice 
gene  sequence  dataset.  The  selected  small  number  of  gene  positions  may  direct  the 
researchers  as  to  where  to  concentrate  on. 


Subspace  classification  on  opt  dataset:  K3.823-A64-C10-I16,  support=8% 
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Figure  3.7.  The  subspace  classification  applied  to  optic  dataset 


Subspace  classification  on  splice  dataset:  K3.190-A60-C3-I8,  support=1 1%,  CV-5 
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Figure  3.8.  The  subspace  classification  applied  to  splice  dataset 
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Subspace  classification  on  spambase:  K4.6-A57-C2-I10.  support  40%,  CV-5 
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Figure  3.9.  The  subspace  classification  applied  to  spambase  dataset 
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Synttietlc  dataset:  K20-A50-C3-S9-P15-I10-N10.  support  1.1% 
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Figure  3.10.  The  subspace  classification  applied  to  a  synthetic  dataset 
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The  purpose  of  the  spambase  dataset  is  to  filter  out  incoming  spam  mails  [1]. 
Each  record  contains  various  features  of  an  e-mail  such  as  the  frequencies  of  specific 
words  and  their  lengths.  In  general,  when  one  tries  to  collect  a  dataset  for  this  kind 
of  purpose,  it  is  difficult  to  determine  what  to  look  at  in  the  beginning.  A  good 
example  is  the  World  Wide  Web,  where  data  are  in  many  different  formats  and  high- 
dimensional  in  nature.  The  subspace  identification  algorithms  can  be  used  to  discover 
the  more  important  features  of  the  data  in  the  applications  of  this  category.  As  in 
the  case  of  the  spambase  dataset,  it  turns  out  that  only  a  few  attributes  (say,  12  out 
of  57)  need  to  be  examined  for  the  incoming  e-mails  in  order  to  determine  if  they  are 
spam.  Using  the  12  most  significant  attributes,  we  could  build  a  classifier  with  89.7% 
accuracy  whereas  applying  all  57  attributes  resulted  in  one  with  91.6%  accuracy. 

3.3.2    Comparison  of  the  Algorithms 

Tables  3.2  to  3.4  compare  the  two  subspace  identification  algorithms  on  the  let- 
ter, splice,  and  the  synthetic  datasets,  respectively,  together  with  the  naive  approach 
(PreClass  method),  applied  to  classification.  Tree  sizes  are  shown  only  in  Table  3.2 
since  the  others  presented  in  Tables  3.3  and  3.4  are  similar.  The  headline  rows  con- 
tains the  numbers  of  attributes  that  are  used  to  construct  the  corresponding  classifier. 
For  each  dataset,  the  SubAssoc  algorithm  is  restricted  to  stop  when  it  finds  all  large 
2-itemsets.  Note  that  the  significant  attributes  found  by  the  DataScan  algorithm  may 
be  different  from  those  by  the  SubAssoc  since  the  class  labels  were  not  considered  in 
counting  in  the  DataScan  algorithm.  It  is  interesting  to  note  that  the  PreClass  does 
not  always  result  in  the  classifiers  with  the  best  quality  among  the  three  methods. 
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Note  that  the  halting  point  (large  2-itemsets)  of  the  association  rules  algorithm  is 
the  one  that  ends  up  with  the  best  results  among  all  values  of  k  found.  This  justifies 
the  reasoning  behind  the  DataScan  algorithm. 


Table  3.2.  The  comparisons  on  letter  (K20-A16-C26-I16,  support=0.6%) 


Q  metric 

Method 

7 

8 

9 

10 

11 

12 

13 

14 

Accuracv 

(%) 

PreClass 

87.6 

94.2 

95.2 

98.6 

100.2 

100.2 

100.0 

99.9 

DataScan 

84.5 

92.2 

95.2 

97.5 

100.2 

100.0 

100.0 

100.0 

SubAssoc 

84.5 

89.4 

95.2 

95.2 

97.5 

100.2 

100.0 

100.0 

Tree  Size 
(%) 

PreClass 

99.6 

99.0 

99.5 

98.7 

96.7 

97.2 

99.1 

99.3 

DataScan 

92.1 

98.1 

99.5 

98.3 

96.7 

98.7 

99.0 

99.4 

SubAssoc 

92.1 

100.6 

99.5 

101.1 

99.2 

96.8 

98.8 

99.4 

Table  3.3.  The  comparisons  on  splice  (K3.19-A60-C3-I8,  support=ll%) 


Q  metric 

Method 

5 

6 

7 

8 

9 

10 

11 

Accuracv 
(%) 

PreClass 

87.6 

96.6 

99.6 

99.5 

99.6 

99.4 

100.0 

DataScan 

96.5 

97.5 

98.6 

100.1 

100.1 

100.1 

100.1 

SubAssoc 

96.5 

97.5 

98.6 

100.1 

100.1 

100.1 

100.1 

Table  3.4.  The  comparisons  on  synthetic  data  (K20-A50-C3-I10,  support=l.l%) 


Q  metric 

Method 

4 

5 

6 

7 

8 

9 

10 

Accuracy 

(%) 

PreClass 

99.5 

98.5 

100.8 

100.8 

100.6 

100.6 

100.2 

DataScan 

94.4 

101.4 

100.8 

100.8 

100.7 

100.7 

100.5 

SubAssoc 

99.5 

101.4 

101.4 

101.3 

101.0 

100.8 

100.8 

In  Tables  3.3  and  3.4,  the  accuracy  sometimes  exceeds  that  of  the  original  clas- 
sifier. Again,  this  is  believed  to  be  due  to  the  overfitting  caused  by  the  excessive 
number  of  attributes  in  the  dataset.  Note  that  only  9  out  of  50  attributes  were  used 
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to  specify  the  clusters  when  the  dataset  was  created.  Interestingly,  using  only  5  at- 
tributes achieves  the  best  accuracy  of  all.  The  reason  seems  that,  of  the  9  attributes 
used,  some  of  them  are  involved  with  many  clusters  while  the  others  are  with  only 
one  or  two. 


Table  3.5.  Comparison  of  execution  times  for  subspace  identification  algorithms 


Dataset 

Support  (%) 

|Subspace| 

AIIAttrs 

PreClass 

DataScan 

letter 

0.6 

11 

12.87 

11.58  (4.28) 

9.10  (1.80) 

spambase 

40.0 

9 

7.78 

4.10  (2.72) 

4.08  (2.70) 

aus-credit 

20.0 

3 

0.59 

0.14  (0.09) 

0.35  (0.14) 

splice 

11.0 

11 

3.63 

1.77  (0.52) 

3.88  (2.60) 

optic 

8.0 

43 

5.45 

9.22  (5.42) 

7.98  (3.80) 

Synthetic 

1.1 

5 

33.95 

16.91  (11.83) 

14.88  (9.80) 

Table  3.5  compares  the  execution  times  of  three  subspace  identification  methods. 
The  numbers  in  the  table  are  the  overall  execution  time  in  seconds,  which  is  the 
sum  of  the  time  taken  for  identification  of  the  best  subspace  and  the  time  taken 
to  build  a  classifier  on  the  subspace.  The  time  taken  for  subspace  identification 
alone  is  put  in  parentheses  for  PreClass  and  DataScan.  It  shows  that  the  cost  for 
subspace  identification  ranges  between  mere  20%  and  66%  of  the  total  execution 
time,  except  for  optic  and  splice  datasets  for  which  the  other  benefits  of  subspace 
identification  compensate  the  excessive  execution  time.  The  overall  statistics  reveal 
that  the  DataScan  algorithm  is  even  better  than  the  naive  method  (PreClass). 

3.3.3  Scalability 

This  section  provides  the  results  for  scalability  experiments.  The  scalability 
of  the  subspace  identification  algorithms  were  tested  in  two  terms:  the  number  of 
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attributes  and  the  number  of  records.  In  testing  the  scalability  in  terms  of  the 
number  of  attributes,  we  assume  that  the  number  of  significant  attributes  remains 
unchanged.  Figure  3.11  shows  the  normalized  results  for  the  scalability  test  against 
the  number  of  attributes,  where  only  5  significant  attributes  are  used  for  subspace 
classification  in  each  case.  The  number  of  attributes  ranges  from  50  to  200  with  an 
interval  of  25.  The  accuracy  stays  close  to  100%  of  the  original,  and  the  execution 
time  and  the  amount  of  memory  needed  remain  almost  constant.  The  tree  size  is  not 
a  function  of  the  number  of  attributes,  so  it  is  unpredictable  as  expected. 


Scalability  in  terms  ot  number  of  attributes  (dataset  K20-A*-C3-S9-U5-P15-I10-N10) 
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Figure  3.11.  Dimensionality  scalability  of  subspace  classification  on  synthetic  data 


Figure  3.12  presents  the  normalized  results  for  the  scalability  test  against  the 
number  of  records,  where  only  5  significant  attributes  out  of  50  are  used  for  subspace 
classification  in  each  case.  The  number  of  records  varies  from  10,000  to  100,000.  The 
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plots  for  the  quality  metrics  shown  in  the  figure  remain  constant  at  a  satisfactory 
level,  which  proves  the  technique  is  scalable  in  terms  of  the  number  of  records. 


Scalability  in  terms  of  number  ot  records  (dataset  K--A50-C3-S9-U5-P15-n0-N10) 
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Figure  3.12.  Size  scalability  of  subspace  classification  on  synthetic  dataset 


3.3.4    Experimental  Results  for  Clustering 

The  experiments  for  clustering  are  focused  on  the  effectiveness  of  the  subspace 
identification  algorithms  presented  in  this  research.  Since  it  is  easy  to  know  the 
parameters  of  the  datasets,  the  synthetic  data  generator  described  in  the  beginning 
of  this  section  was  used  to  generate  the  test  datasets.  We  used  two  datasets,  each 
of  which  was  generated  using  a  similar  configuration.  The  cluster  descriptions  are 
identical  for  both  datasets,  while  the  only  difference  of  the  datasets  is  their  sizes. 
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The  parameters  of  the  datasets  are  K20-A50-C3-S9-P15-I10-N10  for  dataset  1  and 
K100-A50-C3-S9-P15-I10-N10  for  dataset  2. 

Table  3.6  contains  the  description  of  the  datasets  used  in  the  experiments.  The 
number  within  the  parentheses  after  each  dimension  represents  the  width  of  the 
interval  within  which  the  corresponding  cluster  was  generated  with  regard  to  the  di- 
mension being  described.  Since  the  range  of  values  is  set  to  [0,100]  for  all  dimensions, 
the  width  indicates  the  density  of  the  points  in  the  cluster  along  the  dimension.  The 
columns  Pointsl  and  Points2  show  the  number  of  points  in  the  corresponding  cluster 
for  dataset  1  and  2,  respectively. 

Table  3.6.  The  description  of  the  datasets  used  in  the  experiments  for  clustering 


Cluster 

Dimensions 

Pointsl 

Points2 

0 

6(20),  10(20),  23(11) 

1,400 

7,000 

1 

6(20),  10(20),  23(10) 

1,400 

7,000 

2 

6(30),  10(20),  23(70) 

1,200 

6,000 

3 

10(15),  17(20) 

1200 

6,000 

4 

10(31),  17(10) 

1200 

6,000 

5 

10(45),  17(30),  22(51) 

1,600 

8,000 

6 

6(10),  23(20) 

1200 

6,000 

7 

6(10),  23(10) 

1400 

7,000 

8 

6(40),  23(30) 

1400 

7,000 

9 

6(15),  17(30),  40(41) 

1,600 

8,000 

10 

6(20),  17(20),  40(15) 

1,200 

6,000 

11 

6(30),  17(30),  40(10) 

1,200 

6,000 

12 

5(51),  20(51),  30(51) 

600 

3,000 

13 

5(51),  20(50),  30(41) 

600 

3,000 

14 

5(51),  20(51) 

800 

4,000 

Table  3.7  shows  the  results  of  subspace  identification  for  the  two  datasets  for 
0.5%  support.  The  results  indicate  that  the  subspace  identification  presented  in 
this  research  can  capture  the  significant  attributes  in  the  order  of  frequencies  they 
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appear  in  the  definition  of  clusters.  In  particular,  for  dataset  1,  all  the  significant 
dimensions  used  to  define  the  clusters  are  captured  except  for  dimension  20.  It  is 
also  true  for  dataset  2  except  for  dimension  30.  In  Table  3.7,  the  dimensions  20  and 
30  are  involved  in  defining  three  clusters  (12,  13,  and  14).  Note  that  the  intervals  of 
the  defining  dimensions  lie  within  more  than  half  of  the  entire  interval  (100).  The 
density  of  points  in  the  clusters  defined  by  these  attributes  seem  to  be  much  lower 
than  the  density  in  the  other  clusters.  This  observation  means  that  the  dimensions 
20  and  30  are  most  likely  not  sufficiently  significant  to  form  some  correlation  with 
other  dimensions,  even  though  they  are  used  to  define  three  clusters. 


Table  3.7.  The  identified  subspaces  of  the  synthetic  datasets  for  clustering 


Dataset 

Significant  dimensions 

Decreasing  order  of  importance 

1 

6,10,23,17,22,40,5,20,30 

6, 10,23, 17,40,8,26, 30,-32,5,22,. . . 

2 

6,10,23,17,22,40,5,20,30 

6,23,10,17,40,22,5,20,47,29,. . . 

Note  that  the  results  of  clustering  applied  to  the  subspaces  discovered  by  the 
subspace  identification  algorithms  cannot  be  directly  compared  to  those  by  the  ex- 
isting clustering  algorithms  such  as  DBSCAN  [7],  PROCLUS  [32],  or  BIRCH  [55]. 
Instead,  the  subspace  identification  can  supplement  these  algorithms  in  order  to  ex- 
pedite the  overall  process. 

3.4  Summary 

The  algorithms  proposed  in  this  chapter  to  identify  significant  subspaces  of  high- 
dimensional,  sparse  data  are  motivated  by  an  observation  that  not  all  dimensions  need 
to  be  examined  to  discover  hidden  knowledge  contained  in  the  data.  We  also  notice 
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that  most  of  the  dimensions  are  sometimes  seldom  picked  up  as  deciding  ones  such 
as  the  spHtting  attributes  in  classification,  yet  heavily  involved,  during  the  process  of 
some  data  mining  operations.  The  efficiency  of  data  mining  process  would  be  greatly 
enhanced  if  we  could  identify  those  significant  dimensions  that  affect  the  quality  of 
results  the  most. 

We  present  two  algorithms  that  identify  significant  subspaces  of  high-dimensional, 
sparse  data:  the  SubAssoc  algorithm  and  the  DataScan  algorithm.  The  SubAssoc 
algorithm  combines  a  subspace  clustering  technique  [42]  and  association  rules  min- 
ing [5,  52].  The  significant  attributes  are  the  ones  that  constitute  a  subspace  in  which 
contains  more  number  of  records  (thus  dense)  than  others.  On  the  other  hand,  the 
DataScan  algorithm  provides  a  sensible  alternative  to  the  SubAssoc  algorithm.  It 
enables  users  to  iteratively  choose  an  optimal  subspace  that  could  result  in  the  best 
cost-performance  ratio.  The  identified  subspaces  can  further  be  used  to  reduce  the 
dimensionality  of  the  datasets  without  loss  of  quality. 

The  experimental  results  show  that  the  algorithms  are  effective  as  well  as  effi- 
cient in  identifying  the  significant  attributes  for  two  popular  data  mining  operations: 
classification  and  clustering.  The  experiments  prove  that  the  optimal  subspaces  cho- 
sen by  the  proposed  algorithms  produce  similar,  sometimes  even  better,  results  at  less 
cost.  The  algorithms  are  also  scalable  in  terms  of  both  the  number  of  attributes  and 
the  number  of  records.  An  open  problem  is  to  find  the  optimal  values  of  the  number 
of  intervals  and  the  support  count,  or  to  remove  the  need  of  these  values.  Another 
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open  problem  is  to  develop  a  more  efficient  method  that  discovers  the  ciuasi-optimal 
subspaces. 

In  summary,  the  proposed  subspace  identification  technique  provides  following 
benefits: 

•  Effectiveness/efficiency:  It  provides  an  eflfective  method  that  enables  us  to 
deal  with  high-dimensional,  sparse  data  at  a  lower  cost. 

•  Comprehensibility:  The  small  subset  of  the  most  meaningful  attributes  dis- 
covered can  be  used  to  generate  a  more  compact,  thus  more  comprehensive, 
representation  of  data.  It  also  helps  remove  unnecessary  attributes  from  the 
data,  allowing  reorganization  of  the  data  and  more  efficient  data  collection 
process. 

•  No  requirement  for  a  priori  knowledge:  It  does  not  require  any  a  priori 
knowledge  of  the  data  to  identify  the  significant  attributes. 

•  Scalability:  The  classification  technique  scales  well  with  the  number  of  dimen- 
sions as  well  as  the  number  of  records  (i.e.,  the  size  of  the  data). 

•  Alleviation  of  overfitting:  This  technique  can  be  applied  to  low-dimensional 
data  as  well  to  remove  or  alleviate  the  effect  of  overfitting  of  the  classifier  [20]. 


CHAPTER  4 

INCREMENTAL  CLASSIFICATION  ON  LARGE  DYNAMIC  DATA 


Classification  is  an  important,  well-known  problem  in  the  field  of  data  mining, 
and  has  remained  an  extensive  research  topic  within  several  research  communities. 
Over  the  years,  classification  has  been  successfully  applied  to  diverse  areas  such  as 
retail  target  marketing,  medical  diagnosis,  weather  prediction,  credit  approval,  cus- 
tomer segmentation,  and  fraud  detection  [16].  The  classification  models  [14]  that 
have  been  proposed  in  the  literature  include  Bayesian  classification  [29],  neural  net- 
works [56],  statistical  models  such  as  linear/quadratic  discriminants  [57],  genetic 
models  [58],  and  decision  trees  [15,  18,  19]. 

Thanks  to  the  advances  in  data  collection/storage  technologies,  and  large  scale 
applications,  the  datasets  for  data  mining  applications  have  become  large  and  may 
involve  several  millions  of  records.  Each  record  typically  consists  of  ten  to  hundreds 
attributes,  some  of  them  with  large  number  of  distinct  values.  In  general,  using  large 
datasets  results  in  the  improvement  in  the  accuracy  of  the  classifier  [38,  34],  but  the 
enormity  and  complexity  of  the  data  involved  in  these  applications  make  the  task  of 
classification  computationally  very  expensive.  Since  datasets  are  large,  they  cannot 
reside  completely  in-memory,  which  makes  I/O  a  significant  bottleneck.  Performing 
classification  for  such  large  datasets  requires  development  of  new  techniques  that 
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limit  accesses  to  the  secondary  storage  in  order  to  minimize  the  overall  execution 
time. 

Another  problem  that  makes  classification  more  difficult  is  that  most  datasets 
are  growing  over  time,  continuously  making  the  previously  constructed  classifier  obso- 
lete. Addition  of  new  records  (or  deletion  of  old  records)  may  potentially  invalidate 
the  existing  classifier  that  has  been  built  on  the  old  dataset  (which  is  still  valid). 
However,  constructing  a  classifier  for  large  growing  dataset  from  scratch  would  be 
extremely  wasteful  and  computationally  prohibitive. 

In  this  research,  we  present  a  method  that  incrementally  classifies  a  large  database 
of  records  which  grows  in  size  over  time,  and  that  still  results  in  a  decision  tree  with 
comparable  quality.  The  method,  named  ICE  (/ncremental  Classification  for  E'ver- 
growing  large  datasets),  incrementally  builds  decision  tree  classifiers  for  large  growing 
datasets  using  tree-based  sampling  techniques.  The  basic  idea  is  to  represent  the  class 
distribution  in  the  dataset  by  weighted  samples.  The  weighted  samples  are  extracted 
from  the  nodes  of  intermediate  decision  trees  using  various  sampling  techniques.  As 
the  dataset  grows,  an  intermediate  classifier  is  built  only  on  the  incremental  portion  of 
the  data.  The  weighted  samples  from  the  intermediate  classifier  are  combined  with 
the  previously  generated  samples  to  obtain  an  up-to-date  classifier  for  the  current 
data  in  an  efficient  manner. 

ICE  is  capable  of  handling  large,  growing  data  and  developing  rapid  prototypes 
of  classification  models  since  it  deals  only  with  the  current  incremental  portion  of 
the  data  and  samples  passed  from  the  past.  This  automatically  solves  the  problem 
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of  high  computational  complexity,  and  results  in  limited  access  to  secondary  storage. 
The  World  Wide  Web  is  one  of  the  areas  that  grow  very  quickly  and  thus  generate 
huge  amount  of  data  every  day.  The  Web  also  changes  so  dramatically  that  the 
classified  objects  today  may  not  have  the  same  classification  any  more  after  a  few 
days.  ICE  can  be  successfully  applied  to  such  areas  where  the  data  grow  very  quickly 
and  rapid  prototyping  is  needed. 

4.1    Incremental  Classification 

Problem  definition  A  decision  tree  T  has  been  built  for  a  dataset  D  of  size 
iV  by  a  decision  tree  classification  algorithm.  After  a  certain  period  of  time,  a  new 
incremental  dataset  d  of  size  n  is  collected,  and  a  new  decision  tree  T'  needs  to  be 
built  for  the  combined  dataset  D  +  d.  This  problem  is  nontrivial  because  addition  of 
new  records  may  change  some  splitting  points  and  thus  consequently  the  new  decision 
tree  T'  may  be  quite  different  from  T.  The  goal  is  to  build  a  decision  tree  classifier 
T'  with  minimal  cost  such  that  T'  is  comparably  as  accurate  as  the  decision  tree  that 
would  be  built  for  D  +  d  from  scratch. 

Reusing  previous  computation  It  is  attractive  to  utilize  some  of  previous  com- 
putation that  was  performed  to  build  the  tree  T.  For  instance,  with  an  algorithm 
like  SPRINT,  the  values  of  the  splitting  numerical  attribute  at  each  leaf  node  of  the 
tree  are  sorted.  The  difficulty  is  that  only  the  splitting  numerical  attribute  is  sorted 
at  each  leaf  node  whereas  a  different  splitting  attribute  may  have  been  found  to  each 
leaf. 
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We  can  also  modify  the  decision  tree  algorithm  so  that  some  useful  information 
such  as  class  distribution  is  stored  at  each  node.  However,  class  distribution  alone 
does  not  help  much  to  discover  the  new  splitting  point  because  of  the  instability  of 
impurity-based  split  selection  methods  to  be  described  next.  For  approximation,  class 
distribution  may  be  useful  to  compute  the  splitting  point  for  categorical  attributes, 
whereas  it  is  not  so  helpful  for  numerical  attributes. 

Impurity-based  split  selection  methods  In  this  work,  we  consider  split  selec- 
tion methods  that  produce  binary  split,  and  focus  on  impurity-based  split  selection 
methods  for  two  reasons.  First,  this  class  of  methods  is  the  most  popular  [15,  46],  and 
it  has  been  shown  that  they  produce  decision  trees  with  high  predictive  accuracy  [14]. 
Second,  most  previous  work  in  the  database  area  uses  this  class  [19,  59].  Therefore, 
using  this  class  of  method  will  make  comparisons  easy  and  fair. 

We  use  the  gini  index  [15]  to  illustrate  impurity-based  split  selection  methods. 
Let  n  and  Ci  be  the  total  number  of  records  in  a  set  S  and  the  number  of  records 
that  belong  to  class  i,  respectively.  Let  m  be  the  number  of  classes.  Then,  the  gini 
index  is  defined  as  follows: 

m  f~i 

gim{S)  =  1  -  Y.{-f  (4.1) 

When  S  is  split  into  two  subsets,  and  52,  with  nj  and  1x2  records,  respectively, 
the  corresponding  gini  index  of  the  set  S  is  calculated  in  terms  of  the  gini  index  of 
the  subsets  as  follows: 
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ginispiitiS)  =  —gim{Si)  H  gini{S2) 


(4.2) 


The  gini  index  defined  above  does  not  lend  itself  for  reuse  in  accordance  with 
the  changes  in  the  number  of  records  (and  hence  the  class  distribution).  Impurity- 
based  split  methods  such  as  gini  index  cause  such  instability  [40],  which  makes  the 
induction  of  V  from  T  difficult. 

4.1.1    Weighted  gini  Index 

A  weight  can  be  assigned  to  a  record  when  the  record  represents  a  number 
of  other  records.  Since,  however,  the  gini  index  presented  above  works  only  on  the 
records  with  equal  weight,  it  is  inappropriate  to  handle  such  weighted  records.  There- 
fore, the  gini  index  must  also  change  accordingly  as  each  record  carries  different 
weight.  In  this  section,  we  introduce  the  weighted  gini  index  which  provides  a  math- 
ematical background  for  incremental  classification.  The  motivation  for  the  weighted 
gini  index  is  to  replace  similar  records  by  a  weighted  representative  and  use  them  in 
an  incremental  batch  learning. 

Suppose  that  there  exists  a  set  S'  of  n  sorted  numerical  points  in  nondecreasing 
order,  with  each  point  (call  it  w-point)  associated  with  a  weight  k  for  any  positive 
integer  k  >  \.  Let  Wj  be  the  weight  of  j-th  w-point  in  S' .  Then,  the  total  weight 
sum  W  of  S'  is: 


n 


(4.3) 
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The  set  S'  is  now  divided  into  two  subsets,  S[  and  S2,  with  rii  and  n2  t/;-points, 
respectively,  such  that  n  =  Ui  +  n2  and  all  the  iw-points  in  S[  are  less  than  any 
one  in  Now  the  question  is  how  we  take  the  weights  into  account  in  calculating 
the  impurity  of  a  set.  Recall  that  the  class  distribution  Ci  in  Equation  4.1  is  the 
probability  for  a  point  in  S  to  be  classified  into  class  i.  Let  Qi  be  the  weight  sum  of 
the  ■w-points  that  belong  to  class  i  in  S'.  Then,  Equation  4.1  is  to  be  re-written  as 
following: 

gim-{S')  =  1  -  f:(|f  (4.4) 

We  now  consider  the  distribution  of  iw-points  between  S[  and  52  in  computing 
the  gini  index  in  Equation  4.2.  In  Equation  4.2,  the  terms  ^  and  ^  represent  the 
portions  of  the  subsets  51  and  52,  respectively,  in  the  set  5.  In  other  words,  these 
simple  fractions  mean  that  the  impurity  of  a  set  tends  to  increase  as  the  number  of 
records  contained  in  the  set  increases.  When  the  points  are  associated  with  weights, 
this  statement  illustrates  that  the  impurity  of  a  set  tends  to  increase  as  the  weight 
sum  of  points  contained  in  the  set  increases. 

Let  1^1  and  W2  be  the  weight  sums  of  subsets  S[  and  52,  respectively.  Then, 
we  can  modify  the  Equation  4.2  based  on  Equations  4.3  and  4.4  as  follows: 

9^n^lut{S')  =  ^^gim^{S[)  +  ^5zm-(5^)  (4.5) 

Now  let  us  check  if  gini'^pi-^  is  consistent  with  the  original  ginisput  index,  i.e., 
9i™^piit  =  gi^i split  at  a  common  splitting  point  x.  Suppose  that  w-points  are  chosen 


57 


from  S  such  that  for  any  p  represented  by  s^.i  and  q  represented  by  s,  for  two  adjacent 
lo-points  and  si,  p  <  q  holds  as  depicted  in  Figure  4.1.  Then  there  always  exists 
a  point  X  such  that  p  <  x  <  q  and  x  does  not  cut  into  any  tw-point^ 
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Figure  4.1.  The  set  S'  of  w-points  along  dimension  Z 

Lemma  1  The  gini^^m  index  value  at  a  clear-cut  splitting  point  x  on  a  dimension  is 
consistent  with  the  original  gini  index  value  ginisput  at  the  same  splitting  point. 

Proof  Suppose  a  point  x  that  clear-cuts  the  set  5'  of  size  n  into  two  subsets  5",  and 
with  ni  and  n2  u;-points,  respectively,  where  n  —  ui  +  n2.  Since  there  is  no  vu- 
point  in  which  contains  x,  the  number  of  points  before  (after)  x  remains  to  be  Ui  (712). 
This  yields  that  W^i  =  ni  and  W2  =  n2,  thus  making  W  =  Wi  +  W2  =  Ui  +  n2  =  n. 
Therefore,  with  respect  to  x,  gini'"{S')  =  gini{S)  implying  that  the  class  distribution 
Qi  remains  unchanged  for  subsets  S[  and  S2,  and  thus  gini^{S')spiit  =  ginispiit{S)  at 
any  such  point  x.  □ 

Now,  we  extend  this  property  of  ti;-point  to  a  multi-dimensional  space.  Note 
that  the  decision  tree  is  constructed  such  that  the  splitting  condition  at  each  node 
clearly  divides  the  data  partition  of  the  node  by  the  splitting  value  along  the  dimen- 
sion used  in  the  condition.  Each  divided  subpartition  is  recursively  used  at  the  next 
4n  other  words,  x  is  a  clear-cut  point  on  the  dimension  Z. 
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level.  The  entire  data  can  be  projected  on  to  a  multi-dimensional  hyperspace,  and 
it  can  be  said  that  each  node  represents  a  hyperbox  in  a  multi-dimensional  space 
composed  of  the  attributes  that  participate  in  the  splitting  conditions  along  the  root 
to  the  node. 

Therefore,  assuming  that  a  single  tw-point  represents  all  the  records  of  a  node, 
we  come  to  the  following  lemma. 

Lemma  2  The  gini^pm  index  value  is  consistent  with  the  original  gini  index  value 
ginigpiit  with  respect  to  the  splitting  conditions. 

Proof  It  directly  follows  from  Lemma  1  and  from  the  fact  that  the  splitting  conditions 
along  the  path  from  the  root  to  any  child  node  clearly  divide  each  dimension  of  the 
hyperspace.  □ 

When  more  than  one  w-point  are  chosen  to  represent  a  node.  Lemma  2  may 
not  hold.  Note,  however,  that  the  weighted  gini  index  will  always  be  computed 
between  two  adjacent  ry-points  when  constructing  a  decision  tree  on  a  set  of  w-points. 
If  each  lo-point  is  chosen  from  around  the  center  of  a  set  of  original  (unweighted) 
records  in  each  node,  gini'^pl^^  would  not  be  very  different  from  gini  spin-  Therefore, 
weighted  samples  can  be  used  to  incrementally  construct  a  decision  tree  that  is  not 
very  different  from  the  one  built  from  the  entire  data.  Each  ly-point  is  a  clustered 
representation  of  a  number  of  records,  which  implies  that  clustering  techniques  can 
be  used  to  select  ■u;-points  from  the  original  dataset.  We  describe  how  samples  can 
be  extracted  in  Section  4.2. 
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4.1.2    ICE:  A  Scalable  Method  for  Incremental  Classification 

Yoon  et  al.  proposed  an  efficient  method  called  ICE  for  incremental  classification 
on  large  datasets  [11],  which  is  described  in  Figure  4.2.  D,  denotes  a  partition  of  the 
entire  dataset  that  is  collected  during  i-th  time  period  {epoch  i).  The  partition  may 
have  been  collected  since  the  last  partition  was  classified,  or  that  it  is  partitioned 
simply  because  the  dataset  is  too  large  to  be  processed  at  once.  A  decision  tree  Tj  is 
built  for  Di,  and  a  set  Si  of  samples  representing  is  extracted  from  Ti.  The  set  Si 
is  combined  together  with  the  previous  sets  of  samples  (denoted  Ui  —  5i  (J  •  •  •  U  •S'i)- 
Ui  is  then  used  as  the  training  set  for  the  eventual  decision  tree  classifier  Ci  at 
epoch  i,  after  which  Ui  is  preserved  for  the  next  epoch.  In  Figure  4.2,  straight  lines 
represents  data  flow  while  dotted  lines  denote  the  application  of  the  chosen  decision 
tree  algorithm. 

The  method  requires  building  two  separate  decision  trees,  Ti  and  then  Ci,  at 
epoch  i.  However,  the  cost  for  running  a  decision  tree  algorithm  on  two  comparably 
small  datasets  {Di  and  Ui)  separately  is  relatively  minor.  The  current  entire  dataset 
(U  A)  is  presumably  very  large  compared  to  Di  and  J7j,  making  construction  of  the 
decision  tree  from  scratch  inefficient  and  impractical.  The  dataset  Ui  is  the  space 
overhead  of  ICE.  This  overhead  saves  a,t  least  one  pass  over  the  entire  dataset,  and 
makes  the  overall  process  very  efficient  and  practical. 

ICE  not  only  is  suitable  for  incremental  classification  when  the  dataset  is  ever- 
growing, but  can  also  be  easily  extended  to  diverse  applications  [11].  The  charac- 
teristics of  ICE  include  size  scalability,  easy  parallelization,  inherent  migration  into 


60 


distributed  environment,  temporal  scalability,  algorithm  independence,  minimal  data 
access,  and  flexibility  in  dealing  with  addition  and  deletion  of  records  as  partition. 

4.2    Sampling  for  Incremental  Classification 

In  this  section,  we  discuss  sampling  techniques  for  incremental  classification  in 
association  with  ICE.  Note  that  there  is  supposedly  no  superior  sampling  method  that 
consistently  outperforms  the  others.  Moreover,  no  classification  method  works  well 
for  all  applications.  The  quality  of  the  resulting  classifier  depends  primarily  upon 
the  characteristics  of  the  dataset.  The  goal  pursued  in  this  research  is  to  identify 
sampling  techniques  that  are  better  suited  for  the  incremental  classification  in  terms 
of  accuracy  and  performance. 

4.2.1    Sampling  from  Decision  Tree 

One  of  the  problems  with  random  sampling  is  that  it  is  sensitive  to  the  distribu- 
tion of  data:  when  the  dataset  is  skewed,  the  randomly  chosen  samples  are  not  likely 
to  be  a  good  representation  of  the  dataset.  Due  to  this,  the  resulting  decision  tree  is 
generated  nondeterministically,  depending  on  what  samples  are  chosen.  We  observe 
that  the  decision  tree  built  on  a  dataset  can  be  used  to  generate  a  good  sample  of 
the  dataset  (Section  4.1.1).  Since  the  skewedness  of  dataset  is  already  reflected  to 
the  tree  itself,  we  can  reliably  generate  good  samples  in  a  more  consistent  manner 
from  the  tree.  Such  a  sampling  method  is  expected  to  result  in  a  new  decision  tree 
with  better  or  comparable  quality  than  random  sampling  does,  let  alone  eflSciency. 
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Figure  4.2.  The  description  of  ICE  over  four  epochs 
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Figure  4.3.  The  hyperbox  representation  of  a  decision  tree  classifier 
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Each  node  in  a  decision  tree  includes  a  set  of  records,  which  in  turn  is  a  subset 
of  the  records  in  its  parent  node.  Therefore,  one  can  view  each  node  of  a  decision 
tree  as  a  hyperbox  in  a  multi-dimensional  hyperspace  composed  of  the  attributes  of 
the  dataset.  Figure  4.3  shows  such  hyperboxes  of  the  decision  tree  derived  from  the 
dataset  in  Figure  2.1(a).  An  area  enclosed  by  dotted  lines  constitutes  a  hyperbox 
for  which  a  leaf  node  represents.  The  leftmost  box  corresponds  to  the  node  B  for 
records  with  age  <  35  and  all  kinds  of  profession.  The  top-right  box  corresponds  to 
the  node  E  for  records  with  age  >  35  and  profession  =  {business,  teacher,  professor} 
,  whereas  the  bottom-right  box  corresponds  to  the  node  D  for  records  with  age  >  35 
and  profession  =  {  clerk  }  .  Note  that  the  node  A  represents  the  entire  dataset, 
and  the  node  C  consists  of  the  union  of  the  records  represented  by  nodes  D  and 
E.  During  construction  phase,  weighted  samples  from  the  nodes  of  the  tree  can  be 
extracted  easily.  If  the  samples  are  to  be  from  the  leaf  nodes,  for  example,  we  extract 
samples  from  nodes  B,  D,  and  E. 

4.2.2    Sampling  Techniques  for  Incremental  Classification 

As  aforementioned,  the  decision  tree  can  be  used  to  extract  good  weighted  sam- 
ples. Due  to  the  weighted  gin!  index  discussed  in  Section  4.1.1,  we  expect  these 
samples  to  show  better  characteristics  than  the  ones  drawn  directly  from  the  dataset. 
Consequently,  these  samples  will  enable  a  better  decision  tree  classifier  to  be  con- 
structed in  an  incremental  fashion.  We  provide  a  few  sampling  techniques  that  can 
be  used  within  ICE  for  incremental  classification  in  this  section.  The  effectiveness  of 
these  techniques  will  be  explored  in  Section  4.3. 
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Random  sampling  Weighted  samples  are  chosen  randomly  from  each  node  of 
the  decision  tree  built  for  the  current  incremental  partition.  We  denote  this  method 
as  ICE-random.  While  this  method  is  simple  and  easy,  the  probability  of  right  samples 
to  be  selected  is  still  low.  Therefore,  the  classifier  built  on  random  samples  may  be 
still  inconsistent  in  terms  of  quality.  The  accuracy  of  the  resulting  classifier,  for 
instance,  may  fluctuate  depending  on  the  samples  chosen. 

An  alternative  to  random  sampling  is  stratified  sampling  by  which  samples  are 
randomly  chosen  based  on  class  labels  in  proportion  to  the  size  of  the  class  in  each 
node  [34].  Choosing  samples  in  a  class  is  random.  This  method  may  somewhat 
alleviate  the  drawback  of  random  sampling.  However,  the  samples  chosen  randomly 
within  the  same  class  still  may  not  be  sufficient  enough  to  represent  the  records  at 
the  node  from  which  the  samples  are  chosen. 

Local  clustering  Clustering  operation  discovers  collections  of  points  in  the  data 
space  so  that  the  points  in  a  cluster  are  closer  to  each  other  than  to  the  points  in 
other  clusters  [60].  Classification  is  somewhat  similar  to  clustering  except  that  it  deals 
with  class  labels  to  find  clusters  of  records  with  the  same  class  label.  If  clustering 
is  applied  such  that  it  identifies  clusters  of  the  points  with  the  same  class  label,  the 
cluster  representatives  may  be  a  good  representation  of  the  dataset  that  can  be  used 
for  incremental  classification. 

Therefore,  we  apply  a  clustering  algorithm  to  the  records  in  each  node  of  an 
intermediate  decision  tree  to  obtain  good  weighted  samples.  Samples  are  the  clus- 
ter centers  found  by  the  clustering  algorithm  with  the  weight  being  the  number  of 
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records  in  the  cluster.  Since  clustering  is  executed  against  the  local  partition  in  each 
node,  this  method  is  called  local  clustering,  and  denoted  ICE-local  when  used  within 
ICE.  The  cost  of  clustering  decreases  dramatically  as  the  number  of  records  becomes 
smaller.  Since  it  is  applied  to  a  supposedly  small  set  of  records  in  each  node,  its 
cost  can  be  deemed  minor  compared  to  the  cost  on  the  entire  dataset.  We  can  also 
adopt  one  of  the  well-known  efficient  clustering  algorithms  such  as  BIRCH  [55]  and 
DBSCAN  [7]. 

Global  clustering  Conversely,  a  clustering  technique  can  be  applied  to  each 
incremental  portion  to  extract  samples  for  incremental  classification.  Samples  are  the 
cluster  centers  found  by  the  clustering  algorithm  with  the  weight  being  the  number  of 
records  in  the  cluster.  The  classes  of  the  records  are  considered  in  clustering  records. 
Since  clustering  is  executed  against  the  entire  dataset,  this  method  is  called  global 
clustering,  and  is  denoted  ICE-global  when  used  within  ICE. 

The  cost  of  this  method  could  be  very  expensive  since  we  assume  a  large  dataset. 
However,  this  method  can  become  an  option  if  incremental  portions  are  small  enough 
to  run  this  method. 

4.2.3    Issues  in  Choosing  Weighted  Samples 

The  datasets  used  in  classification  consist  of  many  attributes,  each  of  which 
has  very  different  characteristics  and  value  domain.  Unlike  scientific  datasets  that 
contain  only  numerical  attributes,  identifying  the  representative  values  from  different 
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domains  is  essential  to  the  outcome  of  the  classification.  In  this  section,  we  discuss 
the  issues  involved  in  selecting  weighted  samples  that  may  bring  maximal  results. 

Attribute  value  When  a  sample  is  chosen,  each  attribute  of  the  sample  is  sup- 
posed to  represent  all  the  values  of  the  corresponding  attribute  in  the  records  for 
which  the  sample  represents.  For  a  numerical  attribute,  the  average  of  the  values 
may  be  chosen,  whereas  the  maximum  or  the  minimum  value  can  also  be  a  choice. 

For  a  categorical  attribute,  the  weighted  gini  index  deals  only  with  the  (numer- 
ical) attributes  in  ordered  domain.  Categorical  attributes  need  to  be  handled  in  a 
different  fashion  since  it  is  difficult  to  define  similarity  between  the  values  of  cate- 
gorical attributes  due  to  lack  of  a  priori  structure.  There  have  been  some  work  on 
categorical  databases  in  the  areas  of  clustering  [61]  and  association  rule  finding  [59], 
although  the  problem  has  yet  to  be  tackled  in  the  area  of  classification. 

In  order  to  have  one  value  to  represent  others  for  a  categorical  attribute,  the 
simplest  method  is  to  take  the  majority:  a  value  with  the  largest  count  is  chosen  to 
represent  the  others.  Another  is  to  create  a  vector  of  values  that  keeps  the  histogram 
information.  Identifying  the  value  with  the  largest  count  is  still  easy,  and  as  dataset 
is  partitioned,  correct  histogram  information  of  the  attributes  can  be  passed  down 
the  decision  tree.  In  this  research,  we  take  the  majority  method. 

Weight  assignment  The  background  for  handling  weighted  records  is  provided 
in  Section  4.1.1.  When  the  samples  are  drawn,  however,  weight  may  or  may  not  be 
assigned  to  the  samples.  The  reasoning  behind  this  is  that  the  values  of  the  attributes 
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in  the  chosen  samples  may  not  have  the  best  values  as  a  representative.  In  some  cases, 
simple  weight  assignment  (i.e.,  weight  is  always  1)  may  result  in  a  better  classifier. 

Small  node  Since  most  datasets  for  classification  is  high-dimensional,  the  hy- 
perspace  composed  of  the  attributes  is  expected  to  be  very  sparse.  A  node  with  a 
small  number  of  records  may  consist  of  outliers.  Due  to  this  reason,  the  nodes  with 
small  number  of  records  may  be  ignored  in  sampling.  An  issue  here  is  to  determine 
what  number  of  records  in  a  node  is  to  be  considered  small. 

Use  of  training  set  for  pruning  When  a  classifier  is  built,  part  of  the  training 
dataset  (preferably  the  most  recent  partition)  can  optionally  be  used  for  pruning  the 
classifier.  The  rationale  behind  this  is  that  the  most  recent  data  partition  may  carry 
more  importance  than  the  old  ones  do. 

4.2.4    Cost  of  Incremental  Classification  Using  ICE 

It  is  obvious  that  sampling  adds  an  extra  cost  to  the  task  of  classification. 
Rather  than  discussing  the  cost  of  sampling  itself,  we  put  more  emphasis  on  the  cost 
for  obtaining  the  final  decision  tree  classifier  using  the  sampling  techniques  presented 
in  the  previous  section. 

Let  T{A,  D)  be  the  cost  function  for  building  a  decision  tree  classifier  for  a 
dataset  D  using  an  algorithm  A.  In  general,  T{A,  D)  takes  the  form  of  |D|  •  /(.4,  D), 
where  /(A,  D)  depends  primarily  on  the  size  of  D.  When  \D\  exceeds  some  threshold 
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so  that  A  must  run  out-of-core,  or  A  involves  expensive  operations  such  as  sorting, 
f{A,D)  will  become  asymptotically  large,  elevating  the  total  cost  higher. 

For  a  constant  r  (sampling  rate)  such  that  0  <  r  <  1,  the  cost  of  ICE  at  epoch 
i  using  an  algorithm  A  is  composed  of  three  components:  building  a  tree  ti  for  Di, 
extracting  samples  from  ti,  and  constructing  the  final  classifier  Ci  for  Ui.  Therefore, 
the  cost  for  ICE  is  defined  as  follows: 

Tice{A,  D)  =  T(.4,  A)  +  r  ■  I  Al  +  T{A,  U,) 

The  ratio  R  of  the  cost  of  ICE  to  the  cost  of  building  a  decision  tree  from  scratch 
is  as  follows: 

T{A,Di)  +  r  ■\D,\+T{A,Ui) 
T{A,D) 

Assume  that  each  partition  Di  is  of  the  same  size,  namely  n.  Then,  at  any 
epoch  i,  \D\  =  i  ■  n  and  \Ui\  =  r  ■  i  •  n.  Assuming  f{A,  D)  is  a  constant,  R  becomes 
approximately  ^-y-^,  which  converges  to  r  for  a  large  i.  Note  that  the  cost  of  extracting 
samples  from  U  {r  ■  |  A|)  can  be  ignored  compared  to  T{A,  D).  In  practice,  however, 
since  <  |^|,  the  functions  f{A,Di)  and  f{A,Ui)  are  likely  much  smaller 

than  f{A,D).  This  implies  that  ^^y^  is  the  upper  bound  of  R.  Therefore,  it  is 
clear  that  as  the  dataset  grows  over  time,  building  a  decision  tree  for  the  entire 
dataset  becomes  more  expensive,  whereas  the  cost  of  running  ICE  remains  very  low, 
regardless  of  the  size  of  the  entire  dataset. 
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4.3    Experimental  Results 

In  the  experiments,  we  used  a  simple  greedy  search  as  the  base  decision  tree  clas- 
sifier algorithm  in  ICE.  At  each  node,  the  algorithm  sorts  all  the  numerical  attributes 
of  the  records  to  calculate  the  gini  index.  The  histogram  of  categorical  attributes  is 
discovered  at  the  root  node  and  passed  down  the  tree  iteratively.  The  attribute  with 
the  lowest  gini  index  is  chosen  to  split  the  records  at  each  node.  Note,  however,  that 
any  decision  tree  algorithm  can  be  used  in  ICE. 

The  experimental  results  of  ICE  are  compared  against  that  of  random  sampling 
(from  the  entire  dataset).  The  sampling  techniques  used  are  ICE-local  and  ICE-global 
as  described  in  Section  4.2.  ICE-random  is  omitted  in  this  dissertation  since  its  result 
is  not  comparably  better  than  random  sampling  from  the  entire  dataset.  In  the 
experiments,  a  simple  A;-means  clustering  algorithm  [60]  was  used  with  the  cluster 
seeds  chosen  randomly.  The  value  A;  varies  depending  on  the  size  of  the  node  and  the 
sampling  rate.  The  experiments  were  performed  on  a  DEC  Alpha  workstation  with 
Digital  Unix  OSF  1  V4.0B  and  128  MB  main  memory. 

Table  4.1  summarizes  the  datasets  used  in  the  experiments.  The  items  included 
are  the  number  of  records  in  the  training  set,  the  number  of  records  in  the  test  set,  the 
number  of  attributes  with  the  number  of  categorical  attributes  in  the  parentheses, 
the  number  of  classes,  the  number  of  epochs,  and  the  size  of  epochs  (all  equal), 
respectively.  The  results  for  chess  dataset  are  omitted  because  they  are  similar  to 
those  of  letter  dataset,  except  that  ICE-global  shows  the  best  characteristics.  More 
complete  experimental  results  can  be  found  in  [11]. 
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The  letter  dataset  The  letter  dataset  contains  records  that  represent  charac- 
teristics of  hand-written  alphabet  letters  (Table  4.1)  [1].  Figures  4.4  to  4.7  show 
the  classification  accuracy  of  each  Ci  after  the  first  epoch  (where  Ti  =  Ci)  for  this 
dataset.  The  straight  line  (denoted  Cumulative)  denotes  the  accuracy  of  the  decision 
tree  for  the  cumulative  dataset  IJ  A  at  epoch  i,  using  the  simple  greedy  search  algo- 
rithm. The  dotted  straight  line  (denoted  Partition  i)  represents  the  accuracy  of  the 
decision  tree  Tj  built  on  the  current  partition  Di.  x%  records  of  Di  are  extracted 
from  Ti  as  samples,  where  x  varies  within  1,  2,  5,  10,  20,  30,  and  50.  The  random 
samples  are  chosen  from  the  cumulative  dataset  U  Di  at  epoch  i,  to  which  the  simple 
greedy  search  method  is  applied  (denoted  Random). 

4.3.1    Experimental  Results  with  Real  Datasets 

Figures  4.4  to  4.7  lead  us  to  conclude  that  ICE-local  consistently  results  in  a 
better  classification  accuracy  at  any  sampling  rate  at  any  epoch.  ICE-global  is  close 
to  random  sampling,  albeit  slightly  better.  One  might  speculate  that  the  impurity  of 
the  samples  increases  as  epochs  evolve  and  samples  are  chosen  repeatedly.  However, 
such  effect  of  cumulating  impurity  seems  either  non-existent  or  very  minor  for  this 


Table  4.1.  Description  of  the  datasets  used  in  the  experiments 


Dataset 

Training 

Test 

Attributes 

Classes 

Epochs 

Epoch  size 

letter 

15,000 

5,000 

16  (0) 

26 

5 

3,000 

chess 

25,000 

3,056 

6(0) 

18 

5 

5,000 

person 

500,000 

10,000 

9(3) 

2/3/4 

5 

100,000 

Synthetic 

1,000,000 

16,000 

15  (0) 

3 

10 

100,000 

70 


80 


70 


60 


S50 
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Figure  4.4.  After  epoch  //  for  letter  dataset 
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Figure  4.5.  After  epoch  ///  for  letter  dataset 
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Letter  data  set:  Epocti  IV 
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Figure  4.6.  After  epoch  IV  for  letter  dataset 
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Figure  4.7.  After  epoch  V  for  letter  dataset 
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dataset.  At  about  15%  sampling  rate  or  higher,  ICE  becomes  better  than  T,  and  gets 
close  to  the  accuracy  of  the  decision  tree  built  on  the  cumulative  dataset. 

Table  4.2  collectively  shows  the  measures  of  decision  tree  quality  for  15%  sam- 
pling rate.  Although  15%  of  the  dataset  cannot  be  regarded  small  enough,  the  pur- 
pose of  this  experiment  is  to  show  our  sampling  techniques  outperform  random  sam- 
pling. Each  number  in  the  table  is  an  average  over  10  executions  and  the  numbers  in 
parentheses  are  standard  deviation  that  shows  how  the  results  vary  on  different  runs 
for  the  same  dataset.  Other  parameters  remain  unchanged. 


Table  4.2.  The  quality  measures  for  letter  dataset  at  15%  sampling  rate 


Measure 

Method 

Epoch  2 

Epoch  3 

Epoch  4 

Epoch  5 

Accuracy 
(%) 

Cumulative 

75.1 

78.8 

81.1 

83.4 

Random 

55.4  (0.99) 

60.3  (1.12) 

63.4  (0.90) 

66.2  (0.99) 

ICE-local 

62.8  (1.09) 

65.1  (0.95) 

66.8  (0.76) 

69.1  (0.50) 

ICE-global 

60.5 

63.2 

63.7 

66.6 

Number  of 
leaf  nodes 

Cumulative 

1077 

1358 

1621 

1849 

Random 

278  (12.47) 

370  (5.59) 

453  (12.15) 

527  (17.92) 

ICE-local 

173  (5.59) 

224  (4.34) 

274  (5.86) 

390  (6.24) 

ICE-global 

465 

615 

783 

927 

Execution 

time 
(seconds) 

Cumulative 

3.87 

6.02 

8.08 

10.37 

Random 

0.50 

0.78 

1.06 

1.29 

ICE-local 

2.26 

2.46 

2.77 

2.95 

ICE-global 

0.57 

0.83 

1.12 

1.42 

In  terms  of  accuracy,  it  is  obvious  that  ICE-local  consistently  results  in  better 
decision  trees  than  random  sampling  does.  Note  that  the  standard  deviation  in  the 
table  reveals  how  the  results  vary  in  different  runs.  In  terms  of  tree  size,  it  also 
indicates  that  the  tree  generated  by  ICE-local  is  more  compact  than  that  by  random 
sampling,  hence  making  it  more  comprehensive.  The  results  also  conform  to  the  cost 
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model  given  in  Section  4.2.4.  With  more  realistically  large  datasets,  the  speedup  will 
be  more  evident. 

The  chess  dataset  The  chess  dataset  contains  records  that  represent  three  lo- 
cations on  the  chessboard  for  three  chessmen  at  some  point  of  time  (Table  4.1)  [1]. 

The  experimental  circumstances  are  the  same  as  those  for  letter  dataset  except 
that  sampling  rates  are  5%,  10%,  20%,  30%,  and  50%.  Figure  4.8  displays  only  the 
results  of  epoch  3  and  5  because  other  epochs  produces  similar  results.  Again,  it 
shows  our  sampling  techniques  are  better  than  random  sampling.  However,  it  turns 
out  that  ICE-global  is  a  much  better  option  than  ICE-local  for  this  dataset.  The  other 
quality  measures  are  not  shown  because  the  results  are  similar  to  Table  4.2  for  letter 
dataset. 

As  shown  in  Figures  4.4  to  4.8,  ICE  starts  to  pick  up  the  accuracy  of  the  cumula- 
tive dataset  with  about  15%  sampling  rate  or  higher.  The  two  datasets  on  which  the 
experiments  were  conducted  contain  15,000  records  and  25,000  records  each.  When 
the  framework  is  applied  to  a  larger  dataset  with  millions  of  records  or  more,  we 
expect  the  least  sampling  rate  for  an  optimal  result  to  be  much  smaller. 

4.3.2    Experimental  Results  with  Synthetic  Datasets 

In  order  to  study  the  applicability  of  ICE  to  large  datasets,  we  created  synthetic 
datasets  using  the  data  generator  [45,  19]  that  is  available  from  the  IBM  Quest  home 


Chess  data  set;  Epoch  I 


20  30 
Sampling  rate  (%) 


(a)  Epoch  /// 


Chess  data  set:  Epoch  V 


70  • 


fSSO- 


i  40  - 


30 


20 


Cumulative 
Partition  5 
Local 
Global 
Random 


10 


30 

Sampling  rate  (%) 


50 
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Figure  4.8.  Experimental  results  for  chess  dataset  with  five  epochs 
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page^.  The  class  distributions  are  0.55/0.45,  0.4/0.3/0.3,  and  0.3/0.3/0.2/0.2,  for 
2-class,  3-class,  and  4-class  datasets,  respectively  (Table  4.3). 

Different  data  distributions  are  generated  by  using  distinct  classification  func- 
tions to  assign  class  labels  to  class  values.  Further  details  on  these  predicates  can  be 
found  in  the  IBM  Quest  home  page  [45].  In  the  experiments,  5%  perturbation  factor 
was  used  and  10%  noise  was  created  in  each  dataset  [45].  Only  ICE-local  is  compared 
to  random  sampling  since  running  ICE-global  on  a  large  dataset  (100,000  records  in 
this  case)  is  expensive,  although  the  resultant  accuracy  could  be  better. 

Table  4.3.  Description  of  attributes  in  synthetic  datasets 


Attribute 

Description 

Value 

salary 

Salary 

Uniformly  distributed  from  20,000  to  150,000 

commission 

Commission 

If  salary  >  75,000  then  commission  is  zero 
else  uniformly  between  10,000  and  75,000 

age 

Age 

Uniformly  distributed  from  20  to  80 

edJevel 

Education  level 

Uniformly  chosen  from  0  to  4 

car 

Make  of  the  car 

Uniformly  chosen  from  1  to  20 

zipcode 

Zip  code 

Uniformly  chosen  from  available  zipcodes 

hvalue 

House  value 

Uniformly  distributed  from  0.5  •  A;  •  100,  000 
to  1.5- k  -  100,000 

where  k  E  {0, . . .  ,9}  depending  on  zipcode 

hears 

Years  of  house  owned 

Uniformly  distributed  from  1  to  30 

loan 

Total  loan 

Uniformly  distributed  from  0  to  500,000 

Table  4.4.  Description  of  predicate  functions  used  to  generate  synthetic  datasets 


Function 

Fl 

F3 

F31 

F33 

F41 

F47 

Attributes 

2 

3 

2 

3 

2 

5 

Classes 

2 

2 

3 

3 

4 

4 

The  URL  for  the  page  is  http://www.almaden.ibm.com/cs/quest/demos.html. 
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Figures  4.9  to  4.14  present  the  experimental  results  for  synthetic  datasets  with 


different  numbers  of  classes:  2,  3,  and  4.  Each  figure  is  entitled  with  the  predicate  | 
function,  the  number  of  attributes,  and  the  number  of  classes  involved.  The  dotted  ' 


straight  line  (denoted  EntireSet)  on  top  of  each  figure  is  the  accuracy  of  the  simple 


greedy  search  algorithm  that  is  applied  to  the  entire  dataset.  The  sampling  rate 


varies  within  0.5%,  2.3%,  and  6%,  and  other  parameters  are  similar  to  those  applied 


to  letter  dataset  in  the  previous  section. 


Synthetic  data  set:  500000  records.  5  epochs,  2  classes,  function  1  (2  attributes) 
86 1  1  1  1  [ — 


84  - 


Epoch 

Figure  4.9.  Experimental  results  for  synthetic  dataset  (2  classes,  function  1) 


At  each  epoch,  the  first  two  bars  represent  the  accuracies  of  ICE-local  and  random 
sampling,  respectively,  at  6%  sampling  rate.  Similarly,  the  next  two  pairs  of  bars  show 
the  results  of  sampling  rate  2.3%  and  0.5%.  The  distance  from  the  top  of  each  bar 
to  the  dotted  straight  line  is  the  drift  from  the  accuracy  achieved  from  the  entire 
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Synthetic  data  set:  500000  records,  5  epochs,  2  classes,  function  3  (3  attributes) 


Epoch 


EntireSet 
Local  6% 
Rand  6% 
Local  2.3% 
Rand  2.3% 
Local  0.5% 
Rand  0.5% 


Figure  4.10.  Experimental  results  for  synthetic  dataset  (2  classes,  function  3) 


Synthetic  data  set:  500000  records,  5  epochs,  3  classes,  function  31  (2  attributes) 


Epoch 


Figure  4.11.  Experimental  results 


for  synthetic  dataset  (3  classes,  function  31) 
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Figure  4.13.  Experimental  results 


for  synthetic  dataset  (4  classes,  function  41) 
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dataset.  Other  quality  measures  are  omitted  intentionally  because  they  are  very 
similar  to  those  presented  in  Table  4.2. 


Synthetic  data  set;  500000  records,  5  epochs.  4  classes,  function  47  (5  attributes) 


60 


2^55 


n  40 

O 


35 


30 


20 


Epoch 


Figure  4.14.  Experimental  results  for  synthetic  dataset  (4  classes,  function  47) 


The  results  indicate  that  ICE-local  is  better  than  random  sampling  in  most  cases, 
especially  at  lower  sampling  rates  such  as  0.5%.  The  results  at  2.3%  sampling  rate 
show  that  ICE-local  can  achieve  the  original  accuracy  of  the  entire  dataset.  It  seems 
that  ICE  always  produces  consistent  results  at  any  epoch  regardless  of  sampling  rate. 
On  the  other  hand,  random  sampling  looks  very  unpredictable  at  lower  sampling 
rates,  supposedly  due  to  its  sensitivity  to  data  distribution.  Such  sensitivity  will 
consequently  result  in  decision  trees  of  inferior  quality.  ICE-local  can  result  in  accuracy 
that  is  close  to  the  dotted  line  even  at  a  very  low  sampling  rate.  This  suggests  that 
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ICE-local  is  capable  of  handling  large  datasets  while  generating  comparably  accurate 
decision  tree  classifiers. 

Consequently,  the  experiments  suggest  that  ICE-local  be  a  choice  for  classifying 
large  datasets  incrementally,  say,  with  less  than  only  1%  samples  of  the  dataset.  Since 
the  cost  at  early  epoch  is  relatively  lower,  one  may  run  a  classification  algorithm  on 
the  entire  dataset  at  the  moment.  At  later  epochs  where  the  cost  becomes  expensive, 
ICE-local  can  be  used  to  incrementally  classify  large  datasets  that  grow  in  size  over 
time. 

4.3.3    Experimental  Results  with  Large  Synthetic  Datasets 

ICE  is  tested  for  larger  datasets  with  more  complicated  rulesets  in  this  section. 
The  datasets  used  for  experiments  are  synthetically  created  using  the  data  genera- 
tor [54]  that  is  presented  in  Section  3.3,  where  further  description  of  the  data  gener- 
ator can  be  found.  The  configurations  of  the  datasets  are  summarized  in  Table  4.5. 
The  number  within  the  parentheses  after  each  dimension  represents  the  width  of  the 
interval  within  which  the  corresponding  cluster  was  generated  with  regard  to  the  di- 
mension being  described.  Since  the  range  of  values  is  set  to  [0,100]  for  all  dimensions, 
the  width  indicates  the  density  of  the  points  in  the  cluster  along  the  dimension.  The 
Points  column  shows  the  number  of  points  (i.e.,  records)  in  the  corresponding  cluster. 
Each  dataset  consists  of  3  classes  and  15  attributes  with  9  attributes  used  to  generate 
15  clusters.  In  other  words,  each  dataset  includes  a  ruleset  with  15  rules.  Note  that, 
unlike  the  person  dataset,  the  rules  of  the  dataset  are  made  of  different  attributes. 
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The  experiments  were  conducted  on  a  Sun  Ultra  Enterprise  10000  with  So- 
laris OS  Version  5.6,  8  processors,  and  2  GB  main  memory.  The  sizes  of  Dataset  1 
and  Dataset  2  are  approximately  72  MB  (1  million  records)  and  144  MB  (2  million 
records),  respectively.  When  the  benchmarking  algorithm  (a  simple  greedy  search) 
was  applied  to  Dataset  1,  we  had  to  give  up  collecting  the  result  because  the  algo- 
rithm was  still  executing  after  36  hours  even  with  such  a  powerful  machine.  Instead, 
a  smaller  dataset  (100,000  records)  with  the  same  configuration  as  in  Dataset  1  was 
created  and  used  to  approximate  the  accuracy  of  the  dataset  for  comparison  purpose. 


Table  4.5.  The  description  of  the  synthetic  datasets 


Cluster  ID  & 
Class  ID 

Dataset  1 

Dataset  2 

Dimensions 

Points 

Dimensions 

Points 

0/0 

2(20),  3(20),  7(11) 

70,000 

2(26),  4(20),  9(21) 

140,000 

1/1 

2(20),  3(20),  7(10) 

70,000 

2(25),  4(30),  9(20) 

140,000 

2/2 

2(30),  3(20),  7(70) 

60,000 

2(20),  4(20),  9(40) 

120,000 

3/0 

3(15),  4(20) 

60,000 

2(20),  7(10) 

120,000 

4/1 

3(31),  4(10) 

60,000 

2(45),  7(30) 

120,000 

5/2 

3(45),  4(30),  6(51) 

80,000 

2(30),  7(30) 

160,000 

6/0 

2(10),  7(20) 

60,000 

1(30),  5(31),  8(20) 

120,000 

7/1 

2(10),  7(10) 

70,000 

1(30),  5(40),  8(31) 

140,000 

8/2 

2(40),  7(30) 

70,000 

1(20),  5(21) 

140,000 

9/0 

2(15),  4(30),  9(41) 

80,000 

2(30),  3(20),  7(11) 

160,000 

10/1 

2(20),  4(20),  9(15) 

60,000 

2(20),  3(20),  7(10) 

120,000 

11/2 

2(30),  4(30),  9(10) 

60,000 

2(30),  3(31),  7(70) 

120,000 

12/0 

1(51),  5(51),  8(51) 

30,000 

3(25),  4(10) 

60,000 

13/1 

1(51),  5(50),  8(41) 

30,000 

3(21),  4(20) 

60,000 

14/2 

1(51),  5(51) 

40,000 

3(55),  4(30),  6(41) 

80,000 

The  experimental  results  for  Dataset  1  are  plotted  in  Figure  4.15.  The  results 
for  Dataset  2  are  similar  to  those  for  Dataset  1  and  hence  are  omitted  intentionally. 
The  dataset  is  divided  into  10  equal-sized  partitions  (i.e.,  epochs),  each  with  100,000 


82 


records.  The  straight  Hne  on  the  top  denoted  Estimate  is  the  estimated  accuracy  using 
a  smaller  dataset  with  the  same  configuration  of  Dataset  1.  The  dotted  line  denoted 
Partition  indicates  the  accuracy  of  each  partition  by  the  benchmarking  algorithm,  and 
each  of  the  lines  denoted  ICE  x%  shows  the  accuracy  of  ICE  for  x%  sampling  rate. 
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Synthetic  data  set:  1  million  records  (K1000-A15-C3-S9-P15-N10) 
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Figure  4.15.  Experimental  results  for  synthetic  dataset  (1  million  records,  3  classes) 


The  plots  shown  in  the  figure  bear  important  implications  of  ICE.  First  of  all, 
ICE  can  be  applied  to  large  datasets  that  cannot  otherwise  be  directly  used  to  con- 
struct classifiers  from.  The  total  execution  time  of  ICE  for  Dataset  1  is  approximately 
90  minutes  with  a  slight  storage  overhead  for  samples  and  additional  time  to  build 
a  decision  tree  for  each  partition.  The  maximum  memory  required  is  about  25  MB. 
Such  overhead  can  be  well  justified  by  the  achieved  accuracy  and  the  overall  execu- 
tion time  because  the  cost  for  building  a  classifier  from  the  entire  dataset  is  much 
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more  expensive  than  the  overhead.  Secondly,  the  way  the  partitions  are  handled  by 
ICE  also  hides  the  details  of  computing.  It  means  that  ICE  can  run  on  any  of  various 
computing  configurations:  a  single  processor  (serial  computing),  a  processor  cluster 
(parallel  computing),  a  network  cluster  (distributed  computing),  or  any  combination 
of  them.  Thirdly,  it  suggests  how  ICE  should  be  utilized  for  dynamically  growing 
large  datasets.  The  figure  shows  that  the  accuracy  results  of  ICE  up  to  the  epoch 
6  are  comparably  lower  than  the  overall  estimate.  Note,  however,  that  the  current 
data  size  at  an  early  epoch  is  relatively  small.  Since  the  entire  dataset  is  small,  it 
would  make  sense  to  build  a  classifier  using  the  entire  dataset  at  early  epochs  while 
still  generating  samples  for  later  epochs.  Such  temporary  hybrid  approach  would 
result  in  more  accurate  classifiers  at  early  epochs.  When  the  accuracy  converges  to  a 
satisfactory  level  after  a  few  early  epochs,  ICE  is  applied  alone  to  quickly  construct 
a  prototype  classifier  (which  will  be  used  until  the  next  epoch).  Finally,  the  results 
imply  that  larger  sample  size  may  improve  the  overall  accuracy  results  of  ICE.  How- 
ever, sampling  rate  need  not  be  significantly  high.  As  is  shown  by  the  experiments 
in  Section  4.3.2,  only  a  marginal  portion  of  the  entire  data  is  required  to  build  a 
classifier  of  comparable  accuracy  in  a  very  short  time. 

The  experimental  results  suggest  that  ICE  combined  with  the  proposed  tree- 
based  sampling  techniques  outperforms,  or  at  least  is  comparable  to,  random  sam- 
pling, for  efficient  incremental  classification.  Note  that  random  sampling  (including 
stratified  sampling)  is  simple  and  less  expensive  while  it  is  capable  of  generating 
classifiers  with  comparable  accuracy.  However,  random  sampling  assumes  uniform 
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distribution  of  tlie  data  and  thus  does  not  perform  well  for  datasets  with  skewed 
distribution.  In  addition,  the  outcome  of  random  sampling  is  nondeterministic  and 
may  be  unreliable.  It  may  require  several  runs  from  which  to  choose  the  best  one. 
ICE  can  be  adopted  as  a  compromise  between  the  two  extremes. 

4.4  Summary 

In  this  chapter,  we  introduced  an  efficient  method  called  ICE  for  incremental 
classification  that  employs  tree-based  sampling  techniques.  The  sampling  techniques 
are  independent  of  data  distribution  and  can  substitute  random  sampling  as  a  viable 
alternative  to  building  decision  tree  classifiers  for  large  datasets  in  an  incremental 
fashion.  We  provide  mathematical  background  for  the  sampling  techniques  based  on 
weighted  samples.  The  weighted  samples  are  extracted  from  intermediate  decision 
trees  to  build  an  up  to  date  classifier  incrementally. 

We  conducted  experiments  on  real  datasets  for  effectiveness  of  ICE  and  on  syn- 
thetic datasets  for  scalability.  The  experimental  results  suggest  that  ICE  combined 
with  the  proposed  tree-based  sampling  techniques  outperforms,  or  at  least  is  com- 
parable to,  random  sampling,  for  efficient  incremental  classification.  The  method 
promises  to  be  a  basis  of  incremental  classification  for  large,  ever-growing  datasets 
where  fast  development  of  decision  tree  classifier  is  needed. 

Finally,  we  want  to  point  out  a  few  issues  that  need  to  be  dealt  with  in  the 
future: 


85 


An  efficient  method  that  discovers  an  appropriate  sampling  technique  for  a 
given  dataset,  and  new  sampling  techniques  that  adopt  incremental  clustering 
techniques  [62]. 

How  to  assign  weights  to  categorical  attributes. 

The  time  that  the  record  was  created  can  be  considered  in  updating  the  deci- 
sion tree  classifier.  We  need  a  new  weight  assignment  method  based  on  time 
parameter. 

New  attributes  that  did  not  exist  in  the  previous  dataset,  but  that  are  added  in 
the  new  incremental  dataset  must  be  handled  properly.  The  effects  of  the  new 
attributes  to  the  updated  classifier  have  to  be  identified. 

The  feature  selection  information  from  the  current  decision  tree  classifier  can 
be  used  to  optimize  the  induction  of  new  decision  tree  classifier  when  a  new 
partition  is  added. 

Using  the  tree-based  sampling  method  to  generate  a  good  sample  for  other  more 
expensive  mining  methods  such  as  neural  network. 
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CHAPTER  5 

DATA  ACCESS  MODELS  FOR  EFFICIENT  DATA  MINING 

5.1  Introduction 

High  Energy  Physics  (HEP)  particle  experiments  deal  with  the  interactions  be- 
tween a  target  and  colliding  beam  of  elementary  particles.  CLEO  [63],  one  such 
example  of  an  HEP  experiment  at  Cornell  University,  produces  tens  of  thousands 
of  events  (i.e.,  experimental  observations)  a  day,  which  are  then  reprocessed  and 
archived  into  tape  media.  The  resultant  read-only  data  is  comprised  of  event  objects 
of  varying  sizes.  CLEO  currently  maintains  about  10  TB  of  physics  data  which  will 
grow  to  over  100  TB  in  the  next  few  years. 

Nile  [64],  a  multi-disciplinary  collaboration  of  physics  and  computer  scientists, 
is  developing  a  distributed  computing  solution  for  the  CLEO  and  other  HEP  collabo- 
rations. The  goal  is  to  provide  a  self-managing,  fault-tolerant,  heterogeneous  system 
of  hundreds  of  commodity  workstations,  with  access  to  a  distributed  database  in 
excess  of  100  TB.  These  resources  are  spread  across  the  United  States  and  Canada 
at  24  collaborating  institutions.  One  aspect  of  Nile  is  to  construct  an  efficient  data 
delivery  system  that  is  consistent  with  the  physicist's  use  patterns  and  the  goals  of 
Nile.  The  goal  of  the  data  server  presented  in  this  research  as  a  part  of  the  Nile  ] 

i 

86 


87 


project  is  to  obtain  very  high  performance  in  accessing  a  very  large  data  set  for  the 
physics  analysis  jobs.  In  addition,  the  data  server  must  be  flexible  enough  to  support 
clients  from  different  distributed  computing  environments. 

HEP  data  has  characteristics  that  facilitate  high  performance  data  access.  Events 
are  independent  and  the  access  order  is  not  important.  This  event  independence 
means  that  each  event  can  be  independently  managed  and  read  for  analyses,  pro- 
vided that  each  event  is  guaranteed  to  be  read  for  each  analysis  only  once.  This 
allows  event-level  parallelism  and  a  lot  of  flexibility  in  data  distribution.  HEP  data 
is  mostly  read,  only  rarely  updated.  This  makes  expensive  DBMS  operations  for 
concurrency  control  and  reliability  unnecessary.  Thus,  data  access  performance  is 
the  most  important  requirement. 

What  distinguishes  ASWAN  (advanced  server  for  swift  access  to  Nile)  from  other 
conventional  general  purpose  DBMS's,  and  object  stores  (e.g.,  Ptool  [65])  is  its  design 
approach  in  which  the  data  server  is  optimized  for  typical  operations  in  HEP  data 
analysis  and  designed  to  be  easily  integrated  into  a  parallel  computing  system  [66]. 
ASWAN  is  similar  in  many  ways  to  recently  proposed  special-purpose  file  systems 
(e.g.,  ELFS  [67,  68]  and  disk-directed  I/O  [69]).  However,  the  server  does  not  replace 
existing  file  systems,  instead  it  is  designed  to  sit  on  top  of  existing  data  sources  (file 
systems,  tape  robots,  DBMSs,  etc.).  In  addition,  ASWAN  is  purposefully  designed 
to  be  simple  to  avoid  delays  on  the  critical  path  to  data  access. 

Although  the  data  server  has  been  developed  with  a  particular  application 
(HEP)  in  mind,  it  can  be  quite  an  appropriate  tool  for  any  application  where  the 
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primary  activity  is  to  collect  aggregate  statistics.  The  data  server  may  be  used  for 
data  mining  purposes  especially  in  the  area  of  classification  [19]  and  clustering  [7,  55]. 

5.2  Requirements 

A  conflict  always  exists  between  what  users  need  for  their  job  and  what  the 
system  can  deliver  as  well  as  how  the  system  can  meet  the  user's  need.  The  require- 
ments for  a  data  server  may  be  identified  from  the  perspectives  of  both  clients  and 
the  server. 

The  foremost  among  the  user  requirements  may  be  the  flexibility  and  easiness 
of  the  client  interface  to  the  data  server.  The  client  interface  to  the  server  must  be 
simple,  uniform,  easy  to  understand  and  must  support  standard  database  operations 
like  selection,  projection,  and  so  forth.  The  implementation  of  the  data  server  must 
be  transparent  to  the  client.  Client  should  be  able  to  access  both  local  and  remote 
data  in  a  uniform  way  {access  transparency).  In  other  words,  the  physical  distribution 
of  data  should  be  concealed  as  much  as  possible.  A  client  should  not  be  deteriorated 
by  the  existence  of  other  clients  concurrently  executing  in  the  system  {concurrency 
transparency).  The  data  server  must  be  able  to  consistently  deliver  the  data  set 
to  the  client  without  considerable  degradation  in  performance.  Another  issue  with 
multiple  clients  is  that  the  data  server  must  scale  well  and  gracefully  as  the  number  of 
concurrent  clients  increases.  Fault-tolerance  and  availability  are  other  very  important 
requirements  from  the  client's  perspective. 

One  of  the  most  important  system  requirements  that  ensures  efficient  use  of 
the  system  resources  is  the  performance.  Performance  may  be  defined  in  terms  of 
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both  data  access  rate  and  throughput.  Resources  must  be  efficiently  utilized  so  as 
to  maximize  the  throughput  while  ensuring  a  high  data  access  rate  to  each  client. 
The  data  server  must  be  flexible  enough  to  accommodate  changes  in  technology  and 
system  structure.  In  order  to  achieve  this,  the  data  server  is  required  to  provide  an 
abstraction  of  the  underlying  system.  The  data  server  must  be  scalable  in  terms  of 
the  number  of  storage  devices  to  which  it  is  attached  as  well  as  the  number  of  clients. 
Another  important  aspect  is  portability.  The  server  has  to  smoothly  migrate  from  one 
Nile  collaboration  site  to  another.  Today's  networking  systems  connect  hundreds  of 
heterogeneous  hardwares  and  softwares.  The  same  design  needs  to  be  adopted  to  any 
kind  of  systems  in  use  without  modification.  In  addition,  the  data  server  design  must 
be  modular  so  that  the  server  can  be  easily  maintained  and  remain  cost  effective. 

5.3    Design  and  Implementation 

5.3.1    Design  Goals 

All  the  requirements  described  in  Section  5.2  have  been  considered  in  the  de- 
sign of  ASWAN  except  for  availability  which  is  yet  to  be  addressed.  We  would  not 
elaborate  to  repeat  the  design  goals  that  are  already  presented  in  the  previous  sec- 
tion. Instead,  we  will  deal  with  other  important  issues  in  implementation  in  the 
Section  5.3.4. 
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5.3.2    Data  Server  Architecture 

In  this  section  we  present  a  high  level  description  of  the  data  server  architecture. 
It  may  be  noted  that  a  clean  and  uniform  interface  is  essential  for  easy  and  efficient 
access  to  the  data  server.  Also,  flexibility  and  modularity  require  that  the  data  server 
be  split  into  multiple  processes  or  threads.  Thus,  there  is  a  need  for  clear,  well- 
defined,  logical  and  physical  abstractions.  The  data  server  should  not  be  overloaded 
to  do  functions  like  name  management,  locating  data  sources,  etc.  Instead,  it  is 
desirable  to  have  a  light  data  server,  totally  dedicated  to  data  access.  The  manager 
process  (MP)  and  metadata  server  (MDB)  have  been  introduced  to  establish  this 
goal  (Figure  5.1). 


attribute  registration 


Figure  5.1.  Data  server  architecture 

In  the  absence  of  a  manager  process,  clients  need  to  keep  track  of  a  lot  of 
information  with  regard  to  the  location  of  servers,  port  numbers,  physical  location  of 
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data  and  so  forth.  This  is  certainly  least  desirable  from  the  point  of  view  of  both  the 
client  and  the  server.  Thus,  the  availability  of  manager  process  greatly  simplifies  the 
job  for  clients  to  access  data  from  remote  locations.  All  that  a  client  needs  to  know 
is  how  to  contact  the  manager  process  through  a  well-defined  client  interface.  The 
metadatabase  server  (MDB)  keeps  meta-information  such  as  the  location  of  data, 
data  server  configurations,  port  numbers,  physical  names  of  data  files  and  index  files 
and  so  forth. 

5.3.3  Implementation 

To  achieve  our  goal  of  flexibility,  a  modular  client-server  design  has  been  adopted 
(Figure  5.2).  The  data  server  is  split  into  two  parts:  the  front-end  server  (FS) 
and  back-end  server  (BS).  The  back-end  server  hides  the  low-level  details  of  storage 
system  implementation  and  provides  a  uniform  way  to  access  data,  no  matter  how 
it  is  actually  stored.  The  front-end  server  provides  an  interface  between  clients  and 
the  back-end  servers.  Consequently,  the  front-end  server  provides  the  clients  with  a 
logical  abstraction  of  the  data  server  that  is  independent  of  the  underlying  storage 
system  while  the  back-end  server  is  a  physical  abstraction  of  the  storage  system. 

In  order  to  keep  the  client  interface  uniform,  it  is  required  that  a  front-end  server 
be  assigned  to  a  client  and  never  serve  more  than  a  client  at  a  time.  Within  the  data 
server,  however,  any  combination  of  front-end  and  back-end  servers  is  possible.  A 
front-end  server  may  be  assigned  with  multiple  back-end  servers  (Figure  5.2(b))  or,  a 
back-end  server  may  serve  multiple  front-end  servers  (Figure  5.2(c)).  Note  that  the 
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configuration  of  data  server  is  transparent  to  the  client  because  it  always  deals  only 
with  the  front-end  server  to  which  it  is  assigned. 


(  client  ) 


Disk 

(a)  Base  configuration     (b)  Parallel  access  (c)  Loaded  server 

Figure  5.2.  Two-tier  design  of  the  data  server 


Figure  5.2(a)  describes  the  basic  configuration  of  the  data  server.  A  pair  of 
front-end  and  back-end  servers  handles  the  requests  from  a  client.  It  describes  how 
asynchronous  access  to  data  set  is  achieved.  In  Figure  5.2(b),  the  data  server  is 
configured  to  take  full  advantage  of  having  two  different  disks  in  the  system.  It 
describes  how  parallel  access  to  data  set  is  achieved.  However,  it  must  be  noted  that 
a  back-end  server  can  inherently  handle  multiple  disks  if  the  physical  file  names  it 
receives  reside  in  different  disks.  This  means  that  the  back-end  server  can  handle  even 
NFS  file  access  transparently.  Figure  5.2(c)  depicts  the  data  server  that  is  loaded 
with  multiple  clients  in  an  extreme  case  with  only  one  back-end  server.  In  fact,  the 
data  server  can  serve  any  number  of  clients  in  any  different  form  of  configuration 
described  in  Figure  5.2. 
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Front-end  server  A  client  requests  a  data  stream  by  contacting  the  manager 
process  (MP)  and  providing  the  logical  name  of  the  data  set  desired,  the  selection  of 
vertical  partitions  used  to  reconstruct  event  objects  (vertical  partitioning)  and  other 
similar  information.  The  manager  process  in  turn  probes  the  metadatabase  server 
(MDB)  and  obtains  the  physical  names  of  files  to  be  accessed,  their  location,  contact 
information  of  back-end  server(s)  assigned,  and  so  forth.  A  part  of  this  information 
is  then  passed  on  to  the  back-end  server (s)  chosen  to  retrieve  the  specified  data  set. 

The  front-end  server,  upon  receiving  the  request  from  the  manager  process, 
opens  a  session  with  the  corresponding  back-end  server (s).  Then  it  starts  recon- 
structing the  event  objects  with  the  data  stream  provided  by  the  back-end  server(s). 
The  reconstructed  event  objects  are  shipped  to  the  client  through  a  channel  dedicated 
by  the  MP. 

The  front-end  server  has  only  a  limited  knowledge  of  the  nature  of  the  event 
objects  that  are  shipped  to  the  clients.  Attribute  projections  can  only  be  performed 
by  the  client  by  specifying  the  vertical  partitions  to  be  accessed,  and  specific  event 
objects  selection  can  only  be  performed  by  providing  an  apriori  list  of  event  objects 
to  be  selected  (i.e.,  by  providing  a  bitmap  index  [70]). 

One  disadvantage  of  the  front-end  server's  simplicity  is  that  it  cannot  take  ad- 
vantage of  all  possible  optimizations  at  the  server  side  (i.e.,  aggressive  selections  and 
projections).  However,  the  front-end  server  is  just  smart  enough  to  meet  most  of  the 
design  requirements.  The  ex  experimental  results  section  shows  that  the  front-end 
server  provides  a  high  throughput,  and  the  selections  and  projections  it  performs 
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can  substantially  improve  performance.  Requiring  the  front-end  server  to  actually 
interpret  the  data  that  it  serves  slows  down  a  component  on  the  critical  path  to  the 
data,  and  usually  degrades  performance. 

The  client  receives  events  as  an  uninterpreted  block  of  bytes.  A  separate  software 
component  interprets  the  byte  block  into  a  structure  or  object  that  the  data  analysis 
software  finds  convenient.  Other  Nile  collaborators  are  developing  Scaffold,  which 
constructs  a  C-I-+  object  that  is  convenient  for  HEP  analysis  [71].  Scaffold  is  designed 
to  translate  analysis  job  requests  for  event  object  attributes  into  requests  for  vertical 
partitions  from  the  data  server. 

Back-end  server  The  back-end  server  of  ASWAN  is  a  simple,  high-speed  data 
provider.  It  accesses  the  data  files  upon  requests  and  supplies  a  stream  of  bytes  to  the 
front-end  server  without  trying  to  understand  or  interpret  the  data  being  accessed. 
The  back-end  server  aggressively  prefetches  data  (whenever  possible)  by  attempting 
to  keep  full  the  buffers  shared  with  the  front-end  server (s).  The  simplicity  of  the 
back-end  server  ensures  that  the  critical  path  to  the  data  access  is  not  slowed  down. 

One  obvious  advantage  of  the  two-tier  approach  is  the  ability  to  dynamically 
reconfigure  the  storage  system  based  on  the  available  resources  in  the  target  system. 
For  example,  we  can  exploit  parallel  I/O  at  a  host  with  multiple  disk  drives  by 
assigning  a  back-end  server  to  each  drive.  We  note  that  this  I/O  parallelism  is 
in  addition  to  the  parallelism  provided  by  the  distributed  design  of  the  Nile  data 
subsystem  architecture.  Another  advantage  is  that  this  approach  can  employ  any 
kind  of  storage  system;  robotic  tape  libraries  or  even  a  commercial  DBMS  product 
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can  be  accommodated  without  changing  the  higher  level  interfaces  (i.e.,  the  front-end 
server  and  clients). 

5.3.4    Issues  in  Implementation 

This  section  describes  the  important  issues  in  implementing  the  data  server 
that  are  not  discussed  in  the  previous  sections.  The  issues  described  below  have  been 
implemented  and  become  the  distinguished  features  of  ASWAN. 

Partitioning  The  foremost  goal  of  the  data  server  in  HEP  application  is  to 
provide  the  clients  with  a  stream  of  data  based  on  the  clients'  needs  with  as  high  a 
data  access  rate  as  possible.  Since  not  all  of  the  data  is  necessary  for  a  particular 
analysis,  the  data  server  must  allow  the  client  to  specify  the  projection  of  attributes 
{vertical  partitioning)  and  a  varying  fraction  of  actual  events  {horizontal  partitioning 
or  indexing). 

Asynchronous  access  Asynchronous  access  improves  data  access  performance 
and  hides  disk  access  latency.  The  goal  is  to  overlap  the  disk  access  time  at  the 
back-end  server  and  the  time  for  the  event  reconstruction  and  delivery  at  the  front- 
end  server.  The  performance  can  be  further  enhanced  by  parallel  I/O  operations 
to  multiple  disks  such  that  multiple  data  blocks  on  different  disks  are  accessed  in 
parallel. 
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Prefetching  The  data  server  needs  to  aggressively  prefetch  the  data  to  maxi- 
mize the  performance  (or  to  minimize  the  disk  access  latency).  Most  scientific  anal- 
yses are  I/O  bound,  and  it  becomes  more  and  more  important  to  reduce  the  speed 
gap  between  the  processor  and  the  storage  devices  for  such  processes.  With  the 
prefetching  mode  enabled,  the  back-end  server  keeps  reading  blocks  of  data  from  the 
disk  until  the  end  of  the  data  stream  under  request  (usually  the  end  of  file).  The 
front-end  server  can  then  take  data  blocks  without  waiting  because  the  next  block 
of  data  is  supposedly  always  ready  within  a  minimum  amount  of  time  under  the 
current  configuration.  This  also  results  in  the  reduction  of  communication  between 
the  front-end  server  and  back-end  server  because  it  is  not  necessary  for  the  front-end 
server  to  issue  a  request  for  every  data  block  to  the  back-end  server.  The  fact  that 
the  events  may  be  accessed  in  an  arbitrary  order  also  helps  accomplish  aggressive 
prefetching. 

Random  access  via  index  files  The  data  server  maintains  index  tables  that  con- 
tain mapping  information  from  an  event  object  ID  to  its  logical  location.  The  index 
table  is  easy  to  maintain,  because  data  updates  are  append-only.  Random  access 
is  provided  based  on  the  files  of  event  object  pointers  locally  maintained  and  joined 
with  a  sequence  of  event  object  IDs  provided  by  the  analysis  jobs  at  run-time. 

Support  for  multiple  analvsis  jobs  Multiple  front-end  servers  are  necessary  to 
support  multiple  analysis  jobs.  One  way  is  to  invoke  multiple  server  processes  upon 
request  (process  approach),  while  an  alternative  is  to  implement  the  data  server  as 
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a  multi-threaded  single  process  {thread  approach).  The  process  approach  requires 
shared  memory  buffers  and  expensive  communication  overhead  between  processes. 
The  thread  approach  may  cause  portability  problem  largely  due  to  the  lack  of  stan- 
dardization between  operating  systems.  The  current  implementation  of  the  data 
server  followed  the  process  approach. 

Parallel  I/O  In  order  to  increase  throughput,  the  data  server  supports  parallel 
I/O  with  multiple  local  disks  by  assigning  one  back-end  server  to  each  disk.  To  cite 
an  example,  in  the  case  of  vertical  partitioning,  the  event  components  are  stored  into 
different  disks  and  back-end  servers  load  them  in  parallel. 

Communication  and  memory  management  The  front-end  and  the  back-end  servers 
exchange  control  messages  via  a  message  queue  that  is  implemented  with  the  System 
V  message  queue.  The  front-end  servers  obtain  disk  blocks  from  the  back-end  servers 
via  shared  memory  buffers  implemented  with  System  V  shared  memory.  The  use  of 
shared  memory  can  be  avoided  if  the  thread  approach  is  taken  to  support  for  multiple 
analysis  jobs.  Note  that  the  use  of  shared  memory  for  buffers  avoids  extra  copying 
operations  between  the  front-end  and  back-end  servers. 

5.3.5    Data  Access  Models 

New  data  access  methods  are  required  in  an  environment  where  data  sets  are 
very  large  and  updates  are  rare  and  append-only  such  as  HEP  data  set.  Conventional 
data  access  methods  basically  consist  of  a  series  of  request  and  response  for  each  block 
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of  data.  While  conventional  data  access  methods  provide  the  users  with  a  convenient 
way  of  accessing  files,  they  are  incapable  of  supporting  multiple  jobs  with  a  high 
data  access  rate  and  throughput.  However,  HEP  data  is  so  huge  and  there  exist  so 
many  users  who  want  to  access  the  data  that  new  data  access  methods  have  to  be 
introduced  to  the  system. 

Typical  HEP  analysis  job  does  not  deal  with  an  individual  event,  but  is  only 
interested  in  a  batch  of  unordered  events  in  a  particular  data  set.  Since  there  exist 
a  vast  number  of  events  in  an  HEP  data  set,  a  conventional  access  method  that 
retrieves  events  one  after  another  would  involve  with  very  frequent  communication 
between  the  data  server  and  clients  in  the  client-server  architecture,  and  hence  slow 
down  the  overall  process.  In  the  end,  it  will  result  in  a  very  poor  performance,  both 
in  terms  of  data  access  rate  and  throughput. 

The  data  access  methods  of  ASWAN  have  been  designed  to  be  as  simple  as  they 
can  be  to  achieve  the  foremost  goal  of  high  performance.  In  this  context,  the  data 
server  design  has  taken  beyond  the  conventional  access  methods  and  adopted  two 
new  access  models:  streaming  access  and  selective  access.  A  particular  data  set  is 
accessed  within  a  session,  which  is  established  between  ASWAN  and  an  analysis  job 
that  wants  to  access  the  data  set.  The  client  must  set  up  a  session  and  the  front-end 
server  works  only  within  a  session.  A  client  can  have  multiple  sessions  within  its 
analysis  job. 

Streaming  access  When  a  client  wants  to  scan  all  the  events  in  a  data  set, 
conventional  access  method  would  require  communication  for  every  block  of  events. 
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With  sheer  size  of  the  HEP  data,  however,  this  causes  a  significant  delay  and  is 
absolutely  undesirable.  Therefore,  the  event  objects  must  be  streamed  to  the  client 
so  that  the  client  can  quickly  scan  the  data  set  with  minimal  communication. 


attribute  registration 


Figure  5.3.  Streaming  access  mode 

Figure  5.3  describes  a  session  of  streaming  access  mode.  In  this  access  mode, 
the  client  specifies  a  projection  of  a  data  set  (and  possibly  a  selection  of  events)  to 
the  manager  process.  The  manager  process  then  selects  an  appropriate  data  server 
for  the  job  and  opens  a  streaming  session  between  the  client  and  the  data  server. 
Once  a  session  of  streaming  access  is  established,  the  data  server  starts  to  provide  a 
stream  of  events  from  the  specified  data  set  to  the  client.  The  client  only  waits  for 
blocks  of  events  from  the  data  server  and  never  issues  a  further  request  until  the  end 
of  stream. 
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The  streaming  access  mode  takes  full  advantage  of  event  independence.  Fur- 
thermore, since  the  client  only  needs  to  be  aware  of  the  manager  process,  the  imple- 
mentation of  data  server  becomes  transparent  to  the  clients.  The  manager  process 
can  then  allocate  the  best  resources  currently  available  in  the  system  to  the  client 
by  referring  to  the  metadatabase  server.  The  client  program  that  uses  the  streaming 
access  mode  may  look  like  the  pseudo  code  given  in  Figure  5.4. 

ClientProgramO 
{ 

Contact  MP  for  connection  to  ASWAN; 

//  give  logical  dataset  name  and  get  port  number  for  ASWAN 
Exchange  information  with  MP; 

//  connection  to  ASWAN  established  with  port  number  from  MP 
//  attributes  to  be  streamed  are  registered  here 
Begin  a  session  and  register  attributes  A,  B,  C; 

while  (next  event (A,  B,  C)  is  available) 
process  event (A,  B,  C) ; 

End  session;     //  close  the  connection  to  ASWAN 



Figure  5.4.  An  example  pseudo  code  client  program  using  streaming  access 

Selective  access  Sometimes  a  client  needs  to  access  only  those  events  that  meet 
a  particular  condition.  A  client  may  want  to  dynamically  select  the  event  objects 
based  on  the  values  of  some  attributes.  In  addition,  the  client  must  be  able  to 
dynamically  perform  projections  on  the  attributes  (also  based  on  the  values  of  other 
attributes).  Such  a  capability  is  required  to  give  the  client  freedom  to  develop  its 
program  without  restriction. 
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Let  n  be  the  total  number  of  attributes  a  client  wants  to  look  at  in  an  analysis 
job.  It  would  be  very  inefficient  to  always  get  the  events  of  n  attributes  because  the 
client  may  want  to  look  at  different  parts  of  the  event  object  based  on  the  values  of 
particular  attributes.  Therefore,  instead  of  streaming  n  attributes,  only  m  attributes, 
which  is  smaller  than  n,  can  be  streamed  to  the  client  and  the  other  n  —  m  attributes 
are  dynamically  projected  based  on  the  values  of  m  attributes.  Consequently,  the 
resultant  performance  would  be  close  to  streaming  m  attributes  rather  than  streaming 
all  n  attributes. 

We  call  the  m  attributes  being  streamed  as  the  primary  attributes  and  the  n  —  m 
attributes  that  are  dynamically  projected  based  on  the  values  of  the  m  primary 
attributes  as  the  auxiliary  attributes.  Let  Ti  be  the  time  taken  for  streaming  i 
attributes  to  the  client  and  T^_„  the  time  taken  for  selectively  accessing  m  primary 
attributes  and  n  —  m  auxiliary  attributes  where  0  <  m  <  n.  Then  what  we  want  to 
achieve  regardless  of  the  dynamic  projections  the  client  performs  is: 

Note  that  the  streaming  access  mode  is  a  special  case  of  the  selective  access  mode 
where  n  =  m.  m  must  be  much  smaller  than  n  in  order  to  get  Tm,n  very  close  to 
Tjn-  In  addition,  due  to  the  on-demand  nature  of  accessing  the  auxiliary  attributes, 
the  ratio  of  the  events  dynamically  selected  must  be  small  enough  to  have  the  above 
inequality  satisfied. 
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A  pseudo  code  for  a  client  program  that  uses  selective  access  mode  is  given 
in  Figure  5.5.  Two  attributes,  A  and  B,  are  designated  as  primary  attributes.  In 
the  while  loop,  however,  the  client  can  ask  for  auxiliary  attributes  C,  D  and  E  or, 
auxiliary  attributes  D  and  F,  and  so  on,  based  on  the  values  of  A  and  B.  If  any  of 
the  auxiliary  attributes  was  not  preregistered  but  is  asked  for  the  first  time  in  the 
while  loop,  the  attribute  will  be  automatically  registered  by  the  data  server. 


ClientPr ogram ( ) 
{ 

Contact  MP  for  connection  to  ASWAN; 

//  give  logical  dataset  name  and  get  port  number  for  ASWAN 
Exchange  information  with  MP; 

//  connection  to  ASWAN  established  with  port  number  from  MP 
//  attributes  to  be  streamed  are  registered  here 
//  some  aux  attributes  may  be  registered  here  as  well 
Begin  a  session  and  register  primary  attributes  A,  B; 

//  dynamic  event  selection/projection  is  done  here 
while  (next  event (A,  B)  is  available)  { 

if  (condKA,  B))  process  event  (A,  B:  C,  D,  E) ; 

else  if  (cond2(A,  B))  process  event(A,  B:  D,  F) ; 
else  process  event (A,  B:  C,  F,  G) ; 

} 

End  session;     //  close  the  connection  to  ASWAN 


Figure  5.5.  An  example  pseudo  code  client  program  using  selective  access 


It  may  be  noted  that  the  primary  attributes  must  be  known  to  the  data  server 
when  the  session  is  established  (pre-registraUon) .  The  auxiliary  attributes  may  be 
either  pre-registered  with  the  primary  attributes  or  dynamically  registered  when  they 
are  asked  for  the  first  time  in  the  while  loop  {self-registration).  The  self-registration 
gives  the  clients  freedom  to  develop  their  program  without  having  to  indicate  what 
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they  want  to  access  in  advance.  The  process  of  selective  access  and  self-registration 
is  described  in  Figure  5.6. 


attribute  registration* 

©    ® 


logical  names 
®   \   © 


1 .  primary  attribute  registration 

2.  metadata  for  primary  attributes 

3.  physical  info  for  primary  attributes 

4.  object  stream  (primary  attr) 

5.  aux  attr  registration  on-demand 

6.  metadata  for  aux  attributes 

7.  physical  info  for  aux  attr 

8.  auxiliary  attr  on-demand 


*  Pre-registration  of  primary  attributes  followed  by  self-registration  of  auxiliary  attributes  ON-DEMAND 


Figure  5.6.  Selective  access  mode  with  self-registration 


Two  pairs  of  front-end  and  back-end  servers  are  assigned  to  handle  the  selective 
access  mode.  FS^  is  in  charge  of  streaming  the  primary  attributes  and  its  role 
is  exactly  the  same  as  the  front-end  server  for  the  streaming  access  in  Figure  5.3. 
FSa  takes  care  of  on-demand  requests  for  the  auxiliary  attributes.  Whenever  a  new 
auxiliary  attribute  is  requested,  its  registration  is  always  directed  to  MP  and  then 
to  FSa.  FSp  keeps  streaming  the  primary  attributes  to  the  client  even  when  FSa  is 
handling  a  request  from  the  client,  and  hence  speeds  up  overall  process.  It  may  be 
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noted  that  all  the  communication  (including  the  registration  process)  involved  in  the 
selective  access  is  transparent  to  the  client. 

5.4    Performance  Evaluation 

Performance  of  a  data  server  can  be  measured  in  two  different  terms:  data  access 
rate  and  throughput.  The  data  access  rate  is  expressed  in  terms  of  the  amount  of  data 
accessed  in  a  given  time.  Representing  data  access  rate  in  terms  of  number  of  events 
provided  to  the  clients  in  a  given  time  may  be  less  accurate  and  misleading  because 
the  events  may  vary  in  size.  Throughput  is  a  measure  of  how  many  clients  can 
effectively  carry  out  their  tasks  in  a  given  time.  It  is  difficult  to  measure  throughput 
without  a  specific  standard.  Therefore,  in  this  chapter,  performance  is  measured  only 
in  terms  of  data  access  rate  and  is  given  in  MB/sec. 

This  section  is  mainly  divided  into  two  parts:  benchmarking  the  disk  access 
and  the  performance  of  ASWAN.  The  purpose  of  benchmarking  section  is  to  get  the 
baseline  disk  access  performance  and  analyze  the  behavior  of  disk  access  with  different 
patterns.  Section  5.4.2  will  present  the  performance  of  ASWAN.  Each  attribute  is 
stored  in  a  separate  file. 

The  experiments  were  carried  out  on  two  DEC  Alpha  workstations  having  the 
hostnames  nile02  and  charmOS,  both  with  Digital  Unix  OSFl  V4.0B  and  128MB 
main  memory.  nile02  is  an  Alphastation  250  with  266  MHz  21064A  Alpha  CPU,  2 
MB  cache  and  a  full  duplex  100  Mbps  network  connection.  The  data  is  stored  in  a 
single  external  9  GB  Seagate  Elite  9  disk  connected  to  a  fast  wide  differential  SCSI 
controller.  charmOS  is  an  Alphastation  500  with  333  MHz  21164A  Alpha  CPU,  4 
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MB  cache  and  a  full  duplex  100  Mbps  network  connection.  The  data  is  stored  in  two 
Seagate  Elite  9  disks  connected  to  a  Fast  Wide  SCSI  controller. 

5.4.1    Experiment  I:  Benchmarking 

Sequential  disk  access  Table  5.1  shows  the  baseline  of  disk  access  rate  for  the 
disks  used  in  the  experiments.  A  simple  program  reads  sequentially  from  a  file  of  size 
444  MB  from  the  beginning  to  the  end  with  a  specified  block  size. 

Table  5.1.  Data  access  rate  of  sequential  disk  access 


Host 
machine 

Block  size  (KB) 

16 

64 

128 

256 

nile02 

4.52 

4.52 

4.53 

4.49 

charmOS 

5.73 

5.75 

5.69 

5.74 

Alternating  disk  access  The  realistic  disk  performance  by  alternating  accesses 
several  files  is  shown  in  Table  5.2.  The  total  file  size  are  208  MB  for  2  files,  414  MB 
for  4  files,  and  658  MB  for  6  files.  Each  file  is  identical  or  similar  in  size.  A  simple 
program  reads  blocks  from  the  files  while  alternating  between  files,  i.e.,  reading  a 
block  of  data  from  one  file  and  then  the  next  block  from  another  file. 

Table  5.2.  Data  access  rate  of  alternating  disk  access  (nile02/charm03) 


Number 

Block  size  (KB) 

of  files 

16 

64 

128 

256 

2 

3.48/3.37 

3.53/3.43 

3.42/3.42 

3.41/3.42 

4 

3.25/2.99 

3.17/3.05 

3.29/3.02 

3.25/3.05 

6 

3.22/2.98 

3.24/2.98 

3.21/3.01 

3.25/3.04 
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Data  server  performance:  Real  data  set  The  performance  of  ASWAN  with  a 
real  non-uniform  data  set  is  measured  and  described  in  Table  5.3.  The  data  size 
is  the  same  as  those  of  alternating  access  except  that  an  index  data  of  size  17  MB 
needs  to  be  accessed  additionally.  The  total  number  of  events  is  80000.  ASWAN  runs 
on  either  nile02  or  charmOS.  The  data  files  are  streamed  to  a  simple  client  program 
running  in  a  different  host  machine. 

Table  5.3.  Access  rate  of  streaming  mode  on  a  real  data  set  (nile02/charra03) 


Number 

Block  size  (KB) 

of  attrs 

16 

64 

128 

2 

3.31/3.32 

4.13/4.20 

3.33/3.22 

4 

3.09/3.02 

3.09/3.00 

3.17/2.99 

6 

3.15/2.92 

3.16/2.92 

3.22/2.94 

Data  server  performance:  Svnthetic  data  set    The  performance  of  ASWAN  with 
a  synthetic  uniform  data  set  is  shown  in  Table  5.4.  The  data  size  is  143  MB  for  2 
attributes,  286  MB  for  4  attributes,  and  429  MB  for  6  attributes,  plus  an  index  file 
of  size  15  MB.  The  total  number  of  events  is  250000.  The  data  files  are  streamed  to 
a  simple  client  program  running  in  a  different  host  machine. 

Table  5.4.  Access  rates  of  streaming  mode  on  synthetic  data  sets  in  charmOS 


Number 
of  attrs 

Block  size  (KB) 

16 

64 

128 

2 

3.70 

3.70 

4.02 

4 

3.41 

3.52 

3.54 

6 

3.45 

3.50 

3.53 
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Observations  and  analysis  It  may  be  observed  that  in  all  cases,  data  access  rate 
tends  to  drop  as  the  number  of  files  (i.e.,  attributes)  increases.  Access  to  synthetic 
uniform  data  set  is  faster  as  much  as  0.5  MB/sec  (or  17%).  Finally,  Disks  in  charm03 
are  much  faster  with  sequential  access.  However,  disks  in  nile02  are  slightly  faster 
in  cases  of  alternating  access.  This  may  be  due  to  disk  fragmentation  in  charmOS 
disks. 

Based  on  the  observations  given  above,  one  can  conclude  that  the  data  access 
performance  is  not  strongly  connected  with  the  block  size.  Disk  fragmentation  can 
cause  unnecessary  seek  operations  for  every  access,  and  consequently  reduce  the  per- 
formance. The  experiment  results  show  that  ASWAN  is  capable  of  providing  data 
with  the  speed  of  sequential  disk  access  while  reconstructing  and  delivering  the  events 
to  the  remote  client.  A  smaller  number  of  files  can  result  in  a  higher  data  access  rate, 
keeping  the  size  of  components  in  each  file  nearly  identical.  Disk  striping  and  parallel 
disk  access  with  multiple  disks  can  also  improve  the  performance. 

5.4.2    Experiment  II:  Performance  of  ASWAN 

Streaming  access  mode  The  performance  of  ASWAN  in  streaming  access  mode 
is  measured  and  shown  in  Table  5.5.  The  data  size  is  225  MB  for  2  attributes,  523  MB 
for  7  attributes,  719  MB  for  12  attributes,  and  958  MB  for  17  attributes.  The  total 
number  of  events  is  80000.  The  data  files  are  streamed  to  a  simple  client  program 
running  in  a  different  host  machine.  The  test  was  performed  on  both  single-disk  and 
double-disk,  in  which  case  the  data  is  equally  assigned  to  each  disk. 


108 


Selective  access  mode  The  performance  of  ASWAN  in  selective  access  mode  is 
measured  and  shown  in  Table  5.6.  The  data  size  is  523  MB  for  7  attributes,  719  MB 
for  12  attributes,  and  958  MB  for  17  attributes,  plus  the  size  of  selected  auxiliary 
attributes.  Two  primary  attributes  are  streamed  to  a  simple  client  program  running 
in  a  different  host  machine  while  other  auxiliary  attributes  are  selectively  requested 
based  on  the  values  of  the  primary  attributes.  The  other  conditions  remain  the  same 
as  in  the  experiments  of  the  streaming  access  in  this  section.  Data  selection  is  the 
ratio  of  the  number  of  events  requested  on  demand  to  the  total  number  of  events 
(80000). 

Table  5.5.  Access  rates  of  streaming  access  mode  (nile02/charm03) 


Number 
of  attrs 

Block  size  (KB) 

16 

64 

128 

2 

3.40/5.13 

3.97/6.67 

3.32/6.59 

7 

3.15/4.45 

3.23/4.40 

3.27/4.48 

12 

3.29/5.19 

3.30/5.58 

3.29/5.38 

17 

2.88/5.45 

3.18/5.13 

3.23/5.37 

Table  5.6.  Time  in  seconds  taken  for  selective  access  with  64  KB  block  size 


Number 
of  aux  attrs 

Data  selection  (single/double  disk) 

0.1% 

1% 

5% 

10% 

5 

114.8/71.4 

184.4/106.7 

176.7/105.7 

178.2/108.4 

10 

139.0/98.4 

240.5/175.6 

240.3/182.2 

252.1/188.6 

15 

164.5/121.0 

318.0/264.3 

339.0/280.9 

407.5/417.8 
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Comparison  of  two  access  modes  Figures  5.7  to  5.10  show  how  the  streaming 
access  mode  is  compared  to  the  sequential  disk  access  patterns  described  in  Sec- 
tion 5.4.1.  In  Figure  5.7,  Seq  is  the  data  access  rate  of  sequential  disk  access,  and 
Alt2  is  of  alternating  disk  access  with  2  files,  both  with  64  KB  block  size  in  nile02. 
Stream2  and  Streaml/  (dStream2  and  dStreamlT)  are  data  access  rates  of  streaming 
access  mode  with  2  and  17  attributes  with  a  single  disk  (two  disks),  respectively. 


a: 

§5 


Seq 
Alt2 

Stream2 
Streaml  7 
dStream2 
dStreamI? 


Disk  buffer  size  (KB) 

Figure  5.7.  Comparison  of  streaming  access  to  simple  disk  access  patterns 


It  is  obvious  that  the  streaming  access  mode  always  performs  well  regardless 
of  the  number  of  attributes.  Its  performance  is  very  close  to  or  above  that  of  the 
simplest  disk  access  patterns  such  as  sequential  access  and  alternating  access  with  2 
files. 
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Figure  5.8.  Selective  access  vs.  streaming  access  (m  =  2,  n 
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Figure  5.9.  Selective  access  vs.  streaming  access  (m  =  2,  n  = 
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Figure  5.10.  Selective  access  vs.  streaming  access  (m  =  2,  n  =  17) 


In  Figures  5.8,  5.9  and  5.10,  the  performance  of  selective  access  mode  is  com- 
pared to  that  of  its  counterpart,  the  streaming  access  mode.  In  Figure  5.8,  for  exam- 
ple, the  performance  of  selective  access  for  5  auxiliary  attributes  is  compared  with 
that  of  streaming  2  and  7  attributes  respectively  (as  shown  in  Table  5.5).  The  dotted 
line  on  the  bottom  and  the  straight  line  on  the  top  show  the  lower  and  upper  bounds 
of  the  selective  access  for  5  auxiliary  attributes.  The  inequality  T2  <  T2-a  <  T7  is 
satisfied  in  some  cases,  but  not  in  other  cases.  There  may  be  two  reasons  for  this. 
One  is  that  the  selective  access  mode  is  inherently  slow  by  its  nature  due  to  random 
requests  from  the  client  based  on  the  values  of  primary  attributes.  The  other  is  that 
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the  streaming  access  mode  performs  well  enough  that  it  is  difficult  to  exceed  its  per- 
formance by  any  means.  The  optimization  to  speed  up  the  selective  access  is  still  in 
progress. 

5.5  Summary 

This  chapter  describes  the  design  and  implementation  of  a  high-performance 
data  server,  named  ASWAN,  which  is  tailored  to  the  HEP  data  access  patterns,  and 
measures  its  performance  with  several  experiments.  ASWAN  supports  the  needs  of 
clients  with  two  different  data  access  modes.  The  streaming  access  mode  provides  a 
fast  data  access  for  the  clients  who  want  to  scan  an  entire  data  set  very  quickly.  On 
the  other  hand,  the  selective  access  mode  gives  the  clients  freedom  to  develop  their 
program  without  having  to  worry  about  what  to  access  in  advance  with  relatively 
high  performance.  The  other  advantage  of  selective  access  mode  is  that  it  allows 
clients  to  save  the  network  bandwidth  for  other  concurrent  clients  in  the  system  and 
therefore  increases  the  overall  system  throughput. 

The  selective  access  mode  is  still  being  optimized  to  enhance  its  performance. 
Work  on  the  client  interface  of  ASWAN  is  needed  in  order  for  the  client  to  practically 
use  ASWAN  to  access  the  HEP  data  set.  The  data  set  must  be  reorganized  and 
maintained  in  the  metadatabase  for  easy  and  consistent  data  management.  In  the 
future,  ASWAN  needs  to  be  improved  to  handle  the  data  in  tapes  since  most  of  the 
HEP  data  is  stored  in  tapes. 


CHAPTER  6 

AUTOMATED  WEB  RATING  SYSTEM  USING  CLASSIFICATION 

6.1  Introduction 

The  Internet  not  only  creates  new  business  opportunities  but  also  brings  new 
markets  to  the  existing  businesses.  The  number  of  Websites  is  explosively  growing 
on  the  Internet,  and  a  numerous  Websites  are  being  created  while  a  number  of  others 
disappear  at  any  given  moment.  The  explosive  increase  in  the  number  of  Websites  and 
their  instability  have  brought  great  confusion  to  the  users  and  become  a  critical  issue 
recently,  especially  in  the  e-commerce  arena.  This  establishes  the  need  for  objective 
appraisal  of  the  commercial  Websites  based  on  various  aspects  of  the  businesses. 

There  already  exist  many  Websites  that  try  to  rate  online  retailing  Websites. 
However,  their  evaluation  method  of  the  Websites  is  based  on  one  dimension,  or  two 
at  best,  and  therefore  does  not  correctly  reflect  the  true  value  of  the  Websites  from 
the  customers'  points  of  view.  The  dominant  factor  to  evaluate  the  online  retailing 
Websites  is  the  price  of  the  goods.  Another  factor  is  the  customers'  opinion.  However, 
the  price  alone  cannot  determine  the  goodness  of  the  business,  and  the  customers' 
opinion  can  be  easily  biased  and  may  mislead  the  silent  majority. 
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Under  these  circumstances,  objective  appraisal  of  the  innumerable  Websites 
based  on  various  aspects  is  immensely  desired  and  will  alleviate  the  users'  difficulty 
in  choosing  the  right  ones  that  fit  in  their  standards  and  requirements.  However,  it 
is  practically  impossible  to  manually  evaluate  the  Websites  due  to  their  uncountable 
population.  A  model  needs  to  be  built  in  order  to  consistently  evaluate  the  Websites 
based  on  various  aspects  of  the  respective  business  areas.  On  the  other  hand,  the 
characteristics  of  the  Web  itself  hinder  the  automation  of  such  task.  In  general,  the 
Web  data  is  unstructured,  or  semi-structured  at  best,  which  makes  it  difficult  to 
collect  and  analyze  the  data  in  it.  Each  Website  is  uniquely  designed  and  consists 
of  various  data  formats  such  as  text,  image,  and  sound.  The  last,  but  not  the  least, 
feature  of  the  Web  data  is  that  the  Web  is  evolving  dynamically.  The  contents  of  the 
Web  continuously  change,  and  the  previously  rated  Websites  as  of  today  may  become 
obsolete  in  only  a  few  days. 

Classification  can  be  well  suited  for  such  task  since  it  considers  many  different 
kinds  of  attributes  in  building  a  model  of  classification.  Note,  however,  that  the 
traditional  techniques  are  inherently  inadequate  for  such  dynamic  non-uniform  data 
with  various  formats  which  grows  and  changes  quickly  on  the  Internet.  We  introduce 
a  software  system  to  appraise  the  commercial  Websites  based  on  various  aspects  as 
an  application  of  the  subspace  identification  and  incremental  classification  techniques 
presented  in  the  previous  chapters.  The  system  automatically  performs  the  process 
of  data  collection,  construction  of  the  classification  model,  and  classification  of  the 
individual  Websites  based  on  the  model.  The  automated  process  can  greatly  reduce 
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the  use  of  human  resources  and  minimize  the  individual  inconsistency  and  errors 
that  would  occur  when  human  beings  are  involved  in  such  task.  This  system  is  being 
successfully  deployed  at  an  Internet  business  firm  that  tries  to  appraise  the  online 
merchants  [72,  73]. 

The  rating  system  is  composed  of  two  stages:  the  first  which  builds  the  clas- 
sification model  (the  rating  engine),  and  the  second  which  applies  the  unclassified 
individual  Websites  to  the  model.  Each  stage  will  be  described  in  Section  6.2,  and 
the  data  collection  process  is  also  presented  in  the  context  of  these  two  stages.  The 
design  of  the  system  and  its  components  are  briefly  introduced  in  Section  6.3. 

6.2    The  Stages  of  the  Web  Rating  System 

The  entire  system  is  composed  of  two  stages:  one  to  build  the  rating  engine, 
and  the  second  to  rate  the  Websites  whose  rating  is  unknown.  Each  stage  involves 
a  data  collection  process,  which  essentially  serves  the  same  function  in  both  stages. 
The  flow  of  the  system  is  presented  in  Figure  6.1. 

6.2.1    Stage  I:  Building  the  Rating  Engine 

Selection  of  attributes  The  attributes  of  the  Websites  are  distinguished  into 
two  categories.  The  semantic  attributes  contain  the  information  that  are  semantically 
related  to  the  goodness  of  the  site  being  rated.  When  online  merchants  are  appraised, 
for  instance,  the  semantic  information  would  include  the  number  of  times  that  the 
server  of  the  site  has  been  down  for  the  last  3  months,  the  number  of  contact  points 
(email,  telephone,  fax,  etc),  whether  secure  protocol  is  used  for  ordering,  whether  the 
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refund/return  policies  are  clearly  specified,  and  whether  the  orders  can  be  cancelled. 
The  information  listed  above  can  be  used  to  semantically  describe  the  site  being 
rated. 

The  non-semantic  attributes  are  those  that  do  not  seem  intuitively  related  to 
the  goodness  of  the  sites  being  appraised.  The  examples  include  the  average  size  of 
the  pages  in  the  site,  the  business  classification  (e.g.,  consumer  electronics,  music, 
auction),  and  the  average  number  of  images  contained  in  each  page.  This  kind 
of  information  may  contain  inherent  correlations  between  the  Websites  and  their 
goodness,  which  are  invisible  to  human  beings.  However,  application  of  classification 
technique  to  such  information  may  lead  us  to  the  discovery  of  the  hidden  correlations. 


Attribute 
Selection 


Attribute 
Extraction 


Subspace 
Identification 


Building 
Rating  Model 


Stage  1 
Stage  2 


Figure  6.1.  The  steps  of  the  Web  rating  system 


The  attribute  selection  is  a  process  which  identifies  and  extracts  both  the  se- 
mantic and  non-semantic  information  from  the  Websites  to  be  appraised.  Since  the 
hidden  correlations  cannot  be  known  beforehand,  this  process  must  extract  as  many 
such  attributes  as  possible. 
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Attribute  extraction  As  described  in  Section  2.1,  each  record  in  the  training 
set  is  required  to  include  a  class  value.  It  is  easy  to  assign  the  class  values  to  the  well- 
known  Websites  whose  classification  can  be  agreed  upon  by  common  sense.  Another 
method  is  to  have  experts  to  manually  grade  the  Websites  that  are  to  be  included  in 
the  training  set. 

Once  the  attributes  are  chosen  and  the  class  values  available,  an  intelligent  Web 
agent  explores  the  Web  to  extract  these  attributes  automatically  and  store  them  in 
a  database.  Since  each  Website  has  unique  design  and  contents,  the  Web  agent  is 
required  to  have  intelligence  to  distinguish  such  uniqueness  in  order  to  extract  appro- 
priate information.  The  Web  agent  technology  is  out  of  the  scope  of  this  dissertation 
and  thus  is  not  presented  in  further  detail. 

Subspace  identification  The  number  of  attributes  that  are  selected  and  ex- 
tracted from  the  Websites  may  exceed  a  few  hundreds  because  the  non-semantic 
attributes  are  determined  somewhat  blindly  by  the  reason  stated  above.  Such  blindly 
selected  attributes  can  incur  unnecessary  complexity  and  high  cost  to  the  system. 
Therefore,  the  extracted  attributes  are  simplified  by  using  the  subspace  identifica- 
tion technique  given  in  Chapter  3. 

Construction  of  the  rating  engine  After  the  meaningful  subspace  is  identified 
in  the  previous  stage,  only  those  identified  are  used  to  build  the  rating  engine.  This 
process  that  ranges  from  the  selection  of  attributes  to  construction  of  the  rating 
engine  needs  to  be  repeated  over  time  in  order  to  keep  the  rating  engine  up-to-date. 
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The  incremental  classification  technique  presented  in  Chapter  4  may  be  applied 
to  maintain  the  quality  of  the  rating  engine  as  the  Web  is  constantly  changing.  At  any 
given  time,  the  intelligent  Web  agent  may  be  executed  to  collect  a  new  incremental 
set  of  records  or  simply  a  new  set  of  records  for  the  same  set  of  Websites.  As  time 
goes  on,  the  data  from  old  epochs  are  disposed  of,  and  new  partitions  are  added  to 
build  an  up-to-date  rating  engine. 
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Figure  6.2.  The  architecture  of  the  Web  rating  system 


6.2.2    Stage  II:  Rating  the  Websites 

Using  the  constructed  rating  engine,  the  unknown  Websites  can  now  be  ap- 
praised. The  attributes  to  be  used  in  this  rating  stage  are  extracted  the  same  way  as 
the  training  data  is  obtained.  At  this  time,  however,  only  those  attributes  identified 
to  be  meaningful  are  extracted  for  appraisal  of  the  Websites  whose  class  is  unknown. 
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6.3    The  Architecture  of  the  Rating  System 

The  architecture  of  the  rating  system  is  depicted  in  Figure  6.2.  The  intelligent 
Web  agent  explores  the  Web  to  extract  the  selected  attributes  from  the  specified 
Websites.  The  output  of  the  Web  agent  is  integrated  by  the  attribute  integration 
module  (AIM)  and  stored  in  the  database  DBt-  The  subspace  identification  module 
(SIM)  identifies  the  meaningful  subspace  of  the  entire  data  space  obtained  by  the  Web 
agent.  The  output  of  the  SIM  is  essentially  used  as  the  training  set  for  construction 
of  the  Web  rating  engine  (WRE).  The  identified  subspace  is  fed  back  to  the  AIM  so 
that  the  AIM  can  direct  its  output  into  DBn  for  the  new  unknown  Websites  during 
the  rating  stage.  Note  that  the  system  seamlessly  combines  the  construction  stage 
and  the  rating  stage. 

6.4  Summary 

The  presented  rating  system  is  developed  to  automatically  appraise  the  Websites 
on  the  Internet  which  grows  explosively  and  changes  dynamically.  The  Web  data  is 
difficult  to  collect  and  analyze  due  to  its  heterogeneity  and  unstructuredness.  The 
rating  system  performs  the  task  of  data  collection,  construction  of  the  classification 
model,  and  application  of  the  individual  Websites  to  the  model  automatically.  It 
shows  that  the  research  results  presented  in  this  dissertation  can  be  practically  applied 
to  the  areas  such  as  the  World  Wide  Web.  Because  the  proposed  rating  system  must 
evolve  accordingly  as  the  Web  changes  constantly,  the  quality  metrics  of  the  rating 
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model  are  not  given  in  this  thesis.  More  information  can  be  found  in  the  company's 
Website^  J 


.1 


^The  URL  of  the  company  is  http://www.essue.com  and  http://www.essue.co.kr. 


CHAPTER  7 
CONCLUSION 


Due  to  the  advances  in  data  storage  technologies  and  advent  of  large  scale 
business  and  scientific  applications,  the  data  for  data  mining  has  become  very  large, 
dynamically  growing,  sparse,  and  high-dimensional.  One  remarkable  example  is  the 
explosive  growth  of  Internet  where  huge  amount  of  high-dimensional  data  is  produced 
every  day.  Traditional  data  mining  techniques  falter  when  they  are  applied  to  such 
data,  and  therefore,  new  data  mining  techniques  need  to  be  developed  in  order  to  deal 
with  the  modern  data  with  such  characteristics.  In  this  dissertation,  we  introduce 
algorithms  and  software  for  efficient  data  mining  that  address  such  characteristics  of 
the  modern  data. 

The  primary  problems  that  are  tackled  in  this  dissertation  include  identifica- 
tion of  meaningful  subspaces  for  efficient  data  mining,  incremental  classification  for 
large,  ever-growing  data,  and  efficient  data  access  methods  that  are  designed  to  meet 
the  application  requirements  for  easy  access  to  distributed  data  on  heterogeneous 
systems.  The  dissertation  presents  a  software  system  that  appraises  the  Websites 
as  an  application  of  the  algorithms  developed  for  the  subspace  identification  and 
incremental  classification. 
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We  take  a  density-based  approach  [7]  to  identification  of  meaningful  subspaces 
of  high-dimensional  data  [8].  To  approximate  the  density  of  the  data  points,  each 
dimension  of  the  data  space  is  partitioned  into  equal-length  intervals.  This  means 
that  each  unit  has  the  same  volume,  and  therefore  the  number  of  points  inside  each 
cell  of  all  subspaces  can  be  used  to  approximate  the  density  of  the  cells. 

Two  algorithms  are  proposed  for  subspace  identification:  one  that  adopts  the 
association  rule  mining  (named  SubAssoc)  and  the  other  is  a  variation  of  the  SubAssoc 
(named  DataScan).  The  SubAssoc  algorithm  works  in  a  bottom-up  fashion  to  exploit 
the  monotonicity  of  the  density  criterion  with  respect  to  dimensionality  to  prune  the 
search  space,  which  is  similar  to  the  Apriori  algorithm  [5].  The  DataScan  algorithm  is 
a  simplified  version  of  the  SubAssoc  algorithm  which  is  restricted  to  find  only  large 
2-itemsets  during  a  scan  of  the  dataset. 

We  discuss  how  to  select  an  optimal  subspace  out  of  a  set  of  many  meaningful 
subspaces  identified.  In  order  to  eliminate  the  support  count  (density)  parameter, 
two  alternatives  are  suggested  as  future  research.  One  is  to  apply  a  hypergraph 
partitioning  algorithm  [9]  such  that  a  hypergraph  is  formed  from  the  large  itemsets 
found  by  an  association  rule  algorithm.  After  the  hypergraph  partitioning  algorithm 
is  applied  to  the  hypergraph,  the  set  of  vertices  with  the  best  connectivity  is  chosen  as 
the  most  meaningful  subspace.  The  x-squared  test  [10]  can  be  applied  to  wide  range 
of  problems  involving  correlations.  Basically,  the  significance  of  a  subspace  may  be 
interpreted  as  how  much  the  composing  attributes  are  correlated  one  another. 
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The  efficiency  and  effectiveness  of  the  proposed  subspace  identification  algo- 
rithms are  empirically  proved  on  two  popular  data  mining  operations:  classification 
and  clustering.  The  experiments  also  show  that  the  algorithms  are  scalable  with  the 
dimensionality  as  well  as  the  size  of  the  data. 

The  problem  of  large,  ever-growing,  and  distributed  data  is  addressed  in  the 
context  of  classification  in  this  research.  We  introduce  an  efficient  method  called 
ICE  for  incremental  classification  on  large,  ever-growing  data  that  employs  tree-based 
sampling  techniques  [11].  The  features  of  ICE  include  scalability,  easy  parallelization, 
inherent  migration  into  distributed  environment,  and  flexibility. 

This  method  is  based  on  weighted  samples  and  is  independent  of  data  distri- 
bution because  the  data  distribution  is  implicitly  reflected  into  the  structure  of  the 
decision  trees  from  which  the  weighted  samples  are  extracted.  The  sampling  tech- 
niques are  integrated  with  clustering  operations  based  on  weighted  sampling.  The 
basic  idea  is  to  represent  the  class  distribution  in  the  dataset  by  using  the  weighted 
samples  which  are  extracted  from  the  nodes  of  intermediate  decision  trees  using  a 
clustering  technique.  As  the  data  grows,  an  intermediate  classifler  is  built  only  on 
the  incremental  portion  of  the  data.  The  weighted  samples  from  the  intermediate 
classifier  built  only  on  the  incremental  data  are  combined  with  the  previously  gen- 
erated samples  to  obtain  an  up-to-date  classifier  for  the  entire  data  in  an  efficient, 
incremental  fashion.  The  experimental  results  show  that  the  method  combined  with 
the  sampling  techniques  promises  to  be  a  basis  of  incremental  classification  for  large, 
ever-growing  data. 


124 


The  ICE  can  be  improved  by  developing  an  efficient  metfiod  that  discovers  an 
appropriate  sampling  technique  for  a  given  dataset,  and  new  sampling  techniques 
that  adopt  incremental  clustering  technique  [62].  The  feature  selection  information 
from  the  current  decision  tree  classifier  can  be  used  to  optimize  the  induction  of  a 
new  classifier  when  a  new  incremental  partition  is  added.  We  may  also  consider  the 
time  that  a  record  was  created  in  updating  the  decision  tree  classifier.  A  new  weight 
assignment  scheme  needs  to  be  modified  based  on  this  new  parameter.  It  is  also 
desirable  to  develop  a  scheme  that  assigns  weights  to  categorical  attributes,  rather 
than  just  choosing  the  majority  value. 

Data  mining  applications  usually  deal  with  large,  distributed  data,  and  require 
efficient  data  access  methods  and  easy  user  interface.  We  present  two  data  models 
for  efficient  data  mining  operations,  and  introduce  an  advanced  data  server  called 
ASWAN  that  facilitates  fast  and  efficient  access  to  distributed  data  [12,  13].  The 
server  implements  the  two  data  access  models:  the  streaming  access  mode  which 
guarantees  the  fastest  access  to  the  entire  data,  and  the  selective  access  mode  that 
allows  users  to  access  part  of  the  data  on  demand  very  quickly.  The  server  provides 
an  easy  programming  interface,  and  its  performance  is  evaluated  in  the  context  of 
high  energy  physics  applications. 

The  Internet  is  one  of  the  areas  that  grows  very  quickly  and  changes  continu- 
ously. The  classified  objects  today  may  not  have  the  same  classification  any  more 
after  a  few  days.  In  these  applications,  it  is  essential  to  develop  a  new  technique  that 
is  capable  of  generating  prototypes  quickly  and  dealing  efficiently  with  incremental 
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data.  As  an  application  of  the  subspace  identification  and  incremental  classification 
techniques,  we  introduce  a  software  system  that  automatically  rates  the  Websites 
based  on  the  data  collected  by  an  intelligent  Web  agent.  The  system  has  been  suc- 
cessfully applied  to  a  practical  Internet  business  whose  major  operation  is  to  appraise 
the  Websites  and  assign  an  appropriate  classification  [72,  73]. 
Finally,  the  contributions  of  this  research  are  the  following: 

•  It  provides  efficient  algorithms  that  identify  the  meaningful  subspaces  of  sparse, 
high-dimensional  data,  on  which  the  target  data  mining  operations  can  be  per- 
formed more  effectively. 

•  The  discovered  subspaces  can  result  in  a  more  compact  representation  of  the 
data  and  simplify  the  data  collection  process. 

•  It  introduces  an  efficient  method  to  incrementally  classify  large,  ever-growing 
data  by  employing  the  tree-based  sampling  techniques. 

•  It  presents  two  data  models  and  an  implementation  that  facilitate  efficient 
access  methods  to  large,  distributed  data  for  data  mining. 

•  It  presents  the  first  attempt  to  automatic  appraisal  of  the  Websites  based  on 
all  aspects  that  are  collected  by  an  intelligent  Web  agent. 
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